89 research outputs found

    CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification

    Full text link
    Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater interest than the majority class instances in real-life applications. Recently, several techniques based on sampling methods (under-sampling of the majority class and over-sampling the minority class), cost-sensitive learning methods, and ensemble learning have been used in the literature for classifying imbalanced datasets. In this paper, we introduce a new clustering-based under-sampling approach with boosting (AdaBoost) algorithm, called CUSBoost, for effective imbalanced classification. The proposed algorithm provides an alternative to RUSBoost (random under-sampling with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost) algorithms. We evaluated the performance of CUSBoost algorithm with the state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost, SMOTEBoost on 13 imbalance binary and multi-class datasets with various imbalance ratios. The experimental results show that the CUSBoost is a promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201

    Oversampling for Imbalanced Learning Based on K-Means and SMOTE

    Full text link
    Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

    A novel generative adversarial networks modelling for the class imbalance problem in high dimensional omics data

    Get PDF
    Class imbalance remains a large problem in high-throughput omics analyses, causing bias towards the over-represented class when training machine learning-based classifiers. Oversampling is a common method used to balance classes, allowing for better generalization of the training data. More naive approaches can introduce other biases into the data, being especially sensitive to inaccuracies in the training data, a problem considering the characteristically noisy data obtained in healthcare. This is especially a problem with high-dimensional data. A generative adversarial network-based method is proposed for creating synthetic samples from small, high-dimensional data, to improve upon other more naive generative approaches. The method was compared with ‘synthetic minority over-sampling technique’ (SMOTE) and ‘random oversampling’ (RO). Generative methods were validated by training classifiers on the balanced data

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Privacy-Preserving Generalized Linear Models using Distributed Block Coordinate Descent

    Get PDF
    Combining data from varied sources has considerable potential for knowledge discovery: collaborating data parties can mine data in an expanded feature space, allowing them to explore a larger range of scientific questions. However, data sharing among different parties is highly restricted by legal conditions, ethical concerns, and / or data volume. Fueled by these concerns, the fields of cryptography and distributed learning have made great progress towards privacy-preserving and distributed data mining. However, practical implementations have been hampered by the limited scope or computational complexity of these methods. In this paper, we greatly extend the range of analyses available for vertically partitioned data, i.e., data collected by separate parties with different features on the same subjects. To this end, we present a novel approach for privacy-preserving generalized linear models, a fundamental and powerful framework underlying many prediction and classification procedures. We base our method on a distributed block coordinate descent algorithm to obtain parameter estimates, and we develop an extension to compute accurate standard errors without additional communication cost. We critically evaluate the information transfer for semi-honest collaborators and show that our protocol is secure against data reconstruction. Through both simulated and real-world examples we illustrate the functionality of our proposed algorithm. Without leaking information, our method performs as well on vertically partitioned data as existing methods on combined data -- all within mere minutes of computation time. We conclude that our method is a viable approach for vertically partitioned data analysis with a wide range of real-world applications.Comment: Fully reproducible code for all results and images can be found at https://github.com/vankesteren/privacy-preserving-glm, and the software package can be found at https://github.com/vankesteren/privre

    Oversampling for imbalanced learning based on k-means and smote

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsLearning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language

    Computational Methods for the Analysis of Genomic Data and Biological Processes

    Get PDF
    In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality

    Applying machine learning algorithms to medical knowledge

    Get PDF
    Dissertação de mestrado integrado em Engenharia InformáticaAchieving great and undeniable success in a great variety of industries and businesses has made the term Big Data very popular among the scientific community. Big Data (BD) refers to the ever fast-growing research area in Computer Science (CS) that comprises many work areas across the world. The healthcare sector is widely known to be highly proficient in the production of big quantities of data. It can go from health information, such as the patient’s blood pressure and cholesterol levels, to more private and sensitive data, such as the medical procedures history or the report of ongoing diseases. The application of sophisticated techniques enables a profound and rigorous analysis of data, something a human cannot do in real-time. However, a machine is capable of rapidly collect, group, storage and examine vast amounts of data and extract unknown and possi bly interesting knowledge from it. The algorithms used can discover hidden relationships between attributes that prove to be very useful for a corporation’s work. Buried structures within the produced data can also be detected by these techniques. Machine Learning (ML) methods can be adjusted and modelled to different input representations - this adaptability is one of the factors that contributes to its blooming prosperity. The main goal is to make predictions on data, by building utterly efficient models that can accurately take in the data and thus predict a certain outcome. This is especially important to the healthcare industry since it can considerably improve the lives of many patients. Everything from detecting a type of disease, predicting the chance of morbidity after a hospital stay, to aid in the decision making of treatment strategies are vital to patients as well as to clinicians. Any improvement over established methods that have been previously studied, tested and published are an asset that will improve the patient’s satisfaction about the healthcare performance in medical institutions. This can be achieved by refining those algorithms or implementing new approaches that will make better predictions on the given data. The main objective of this dissertation is to propose ML approaches having acknowledged and evaluated the existent methods used in clinical data. In order to fulfill this goal, an analysis of the state of the art of medical knowledge repositories and scientific papers published related to the selected keywords selected was performed. In this line of work, it is crucial to understand, compare and discuss the results obtained to those previously published. Thus, one of the goals is to suggest new ways of solving those problems and measuring them up against the existent ones.Obter um sucesso enorme e inegável numa grande variedade de indústrias e companhias, tomou o termo Big Data (BD) muito popular entre a comunidade científica. Big Data refere-se à área de investigação em Engenharia Informática que revela um crescimento rápido e está envolvida em várias áreas em todo o mundo. O setor da saúde é universalmente con-hecido por ser altamente frutífero na produção de grandes quantidades de dados. Podem variar desde dados de saúde, tais como, o valor da pressão sanguínea e nível de coles-terol do paciente, até dados mais confidenciais, como o histórico de cirurgias realizadas e doenças diagnosticadas. A aplicação de técnicas sofisticadas permite uma análise profunda e rigorosa dos dados -algo que um ser humano não consegue fazer em tempo real. No entanto, uma máquina não tem dificuldades em recolher, agrupar, armazenar e analisar rapidamente grandes quanti-dades de dados e extrair deles conhecimento que era desconhecido e, possivelmente, interessante. Os algoritmos usados podem ser usados para descobrir relações desconhecidas entre os vários atributos, que se podem revelar bastante úteis para o dia-a-dia de uma empresa. Estruturas e padrões escondidos nos dados podem ser também detetados através das mesmas técnicas. Os métodos de Machine Learning (ML) podem ser ajustados e modela-dos de forma a aceitar diferentes representações de dados de entrada - esta adaptabilidade é um dos fatores mais proeminentes que contribui para a sua prosperidade. O principal objetivo é fazer previsões sobre os dados, de modo a construir modelos totalmente eficientes que possam analisar os dados de forma precisa, e, assim, prever um determinado resultado. Isto é especialmente importante para o setor da saúde, uma vez que pode melhorar consideravelmente a vida de muitos pacientes. Tudo, desde a deteção de um certo tipo de doença, prever a probabilidade de morbilidade após um internamento até a auxiliar na tomada de decisão em relação a estratégias de tratamento, é vital para os pacientes, bem como para os médicos. Portanto, qualquer melhoria em relação a métodos já estabelecidos que foram previamente estudados, testados e publicados é uma mais-valia que melhorará a satisfação do paciente em relação à sua experiência com os serviços de saúde. Tal pode ser alcançado refinando esses algoritmos ou mesmo implementando novas abordagens que farão melhores previsões sobre os dados. O principal objetivo desta dissertação é propor abordagens de ML, fazendo um reconhecimento e avaliando os métodos existentes utilizados em dados médicos. Desta forma, foi posta em prática uma análise ao estado da arte de repositórios de conhecimento médico, bem como a artigos científicos relacionados com esses conjuntos de dados. Assim, é fundamental compreender, comparar e discutir os resultados obtidos com os publicados anteriormente. Portanto, um dos objetivos é sugerir novas formas de resolver os problemas, tecendo uma comparação com os existentes

    Painting a Picture of the Ovarian Cancer N-Glycome

    Get PDF
    Our story begins with the current clinical strategies that are used by clinicians today to (1) screen and detect ovarian cancer in the early-stages, (2) monitor treatment effectiveness, (3) detect ovarian cancer recurrence and (4) stratify ovarian cancer patients. However, these current strategies utilise FDA-approved ovarian cancer biomarkers, such as CA125 and HE4, which are relatively unsuccessful and lack specificity, especially for early stage patients. In the introduction, it is highlighted that these ovarian cancer biomarkers and other disease biomarkers are typically glycoproteins, but their glycan structure-protein function relationship remains unknown. Protein glycosylation is one of the most complex post-translational modifications (PTMs) found in humans, with N-linked glycans playing a significant role in protein folding and conformation, protein stability and activity, cell-cell interaction, and cell signalling pathways. The best approach to study and analyse N-glycans so far has been to structurally characterise them by firstly, releasing them from complex glycoprotein mixtures using PNGase F, and secondly, identifying specific structures using analytical strategies that may potentially translate into clinical strategies. Ultimately, this thesis focuses on analytical techniques, primarily mass spectrometry, that are available to qualitatively and quantitatively assess N-glycosylation while successfully characterising compositional, structural and linkage features with high specificity and sensitivity. Analytical techniques that were explored include liquid chromatography electrospray ionisation tandem mass spectrometry (LC-ESI-MS/MS) and matrix-assisted laser desorption/ionisation time-of-flight mass spectrometry (MALDI-TOF-MS). These analytical techniques have previously been implemented in the clinic for other diseases, however, not yet for ovarian cancer. It may be possible to implement either LC-ESI-MS/MS or MALDI-TOF-MS in the clinic for ovarian cancer using N-glycomic-based approaches since aberrant N-glycosylation patterns have been observed consistently between clinical samples, such as serum, plasma, ascites and tissue. MALDI mass spectrometry imaging (MSI) has emerged as a platform to visualise N-glycans in tissue-specific regions. Outlined in this thesis, our group studied the intrapatient and interpatient variability between early- and late-stage ovarian cancer patients. From our studies, specific N-glycan differences were identified between the early- and late-stage tumour microenvironment that could lead to the development of ovarian cancer diagnosis and prognosis strategies for the clinic.Thesis (Ph.D.) -- University of Adelaide, School of Biological Sciences, 201
    • …
    corecore