89 research outputs found
CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification
Class imbalance classification is a challenging research problem in data
mining and machine learning, as most of the real-life datasets are often
imbalanced in nature. Existing learning algorithms maximise the classification
accuracy by correctly classifying the majority class, but misclassify the
minority class. However, the minority class instances are representing the
concept with greater interest than the majority class instances in real-life
applications. Recently, several techniques based on sampling methods
(under-sampling of the majority class and over-sampling the minority class),
cost-sensitive learning methods, and ensemble learning have been used in the
literature for classifying imbalanced datasets. In this paper, we introduce a
new clustering-based under-sampling approach with boosting (AdaBoost)
algorithm, called CUSBoost, for effective imbalanced classification. The
proposed algorithm provides an alternative to RUSBoost (random under-sampling
with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost)
algorithms. We evaluated the performance of CUSBoost algorithm with the
state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost,
SMOTEBoost on 13 imbalance binary and multi-class datasets with various
imbalance ratios. The experimental results show that the CUSBoost is a
promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201
Oversampling for Imbalanced Learning Based on K-Means and SMOTE
Learning from class-imbalanced data continues to be a common and challenging
problem in supervised learning as standard classification algorithms are
designed to handle balanced class distributions. While different strategies
exist to tackle this problem, methods which generate artificial data to achieve
a balanced class distribution are more versatile than modifications to the
classification algorithm. Such techniques, called oversamplers, modify the
training data, allowing any classifier to be used with class-imbalanced
datasets. Many algorithms have been proposed for this task, but most are
complex and tend to generate unnecessary noise. This work presents a simple and
effective oversampling method based on k-means clustering and SMOTE
oversampling, which avoids the generation of noise and effectively overcomes
imbalances between and within classes. Empirical results of extensive
experiments with 71 datasets show that training data oversampled with the
proposed method improves classification results. Moreover, k-means SMOTE
consistently outperforms other popular oversampling methods. An implementation
is made available in the python programming language.Comment: 19 pages, 8 figure
A novel generative adversarial networks modelling for the class imbalance problem in high dimensional omics data
Class imbalance remains a large problem in high-throughput omics analyses, causing bias towards the over-represented class when training machine learning-based classifiers. Oversampling is a common method used to balance classes, allowing for better generalization of the training data. More naive approaches can introduce other biases into the data, being especially sensitive to inaccuracies in the training data, a problem considering the characteristically noisy data obtained in healthcare. This is especially a problem with high-dimensional data. A generative adversarial network-based method is proposed for creating synthetic samples from small, high-dimensional data, to improve upon other more naive generative approaches. The method was compared with ‘synthetic minority over-sampling technique’ (SMOTE) and ‘random oversampling’ (RO). Generative methods were validated by training classifiers on the balanced data
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Privacy-Preserving Generalized Linear Models using Distributed Block Coordinate Descent
Combining data from varied sources has considerable potential for knowledge
discovery: collaborating data parties can mine data in an expanded feature
space, allowing them to explore a larger range of scientific questions.
However, data sharing among different parties is highly restricted by legal
conditions, ethical concerns, and / or data volume. Fueled by these concerns,
the fields of cryptography and distributed learning have made great progress
towards privacy-preserving and distributed data mining. However, practical
implementations have been hampered by the limited scope or computational
complexity of these methods. In this paper, we greatly extend the range of
analyses available for vertically partitioned data, i.e., data collected by
separate parties with different features on the same subjects. To this end, we
present a novel approach for privacy-preserving generalized linear models, a
fundamental and powerful framework underlying many prediction and
classification procedures. We base our method on a distributed block coordinate
descent algorithm to obtain parameter estimates, and we develop an extension to
compute accurate standard errors without additional communication cost. We
critically evaluate the information transfer for semi-honest collaborators and
show that our protocol is secure against data reconstruction. Through both
simulated and real-world examples we illustrate the functionality of our
proposed algorithm. Without leaking information, our method performs as well on
vertically partitioned data as existing methods on combined data -- all within
mere minutes of computation time. We conclude that our method is a viable
approach for vertically partitioned data analysis with a wide range of
real-world applications.Comment: Fully reproducible code for all results and images can be found at
https://github.com/vankesteren/privacy-preserving-glm, and the software
package can be found at https://github.com/vankesteren/privre
Oversampling for imbalanced learning based on k-means and smote
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsLearning from class-imbalanced data continues to be a common and challenging problem in
supervised learning as standard classification algorithms are designed to handle balanced class
distributions. While different strategies exist to tackle this problem, methods which generate
artificial data to achieve a balanced class distribution are more versatile than modifications to the
classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any
classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this
task, but most are complex and tend to generate unnecessary noise. This work presents a simple and
effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids
the generation of noise and effectively overcomes imbalances between and within classes. Empirical
results of extensive experiments with 71 datasets show that training data oversampled with the
proposed method improves classification results. Moreover, k-means SMOTE consistently
outperforms other popular oversampling methods. An implementation is made available in the
python programming language
Computational Methods for the Analysis of Genomic Data and Biological Processes
In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality
Applying machine learning algorithms to medical knowledge
Dissertação de mestrado integrado em Engenharia InformáticaAchieving great and undeniable success in a great variety of industries and businesses has made the term Big Data very popular among the scientific community. Big Data (BD) refers to the ever fast-growing research area in Computer Science (CS) that comprises many work areas across the world. The healthcare sector is widely known to be highly proficient in
the production of big quantities of data. It can go from health information, such as the
patient’s blood pressure and cholesterol levels, to more private and sensitive data, such as
the medical procedures history or the report of ongoing diseases.
The application of sophisticated techniques enables a profound and rigorous analysis of
data, something a human cannot do in real-time. However, a machine is capable of rapidly
collect, group, storage and examine vast amounts of data and extract unknown and possi bly interesting knowledge from it. The algorithms used can discover hidden relationships
between attributes that prove to be very useful for a corporation’s work. Buried structures
within the produced data can also be detected by these techniques. Machine Learning (ML)
methods can be adjusted and modelled to different input representations - this adaptability
is one of the factors that contributes to its blooming prosperity.
The main goal is to make predictions on data, by building utterly efficient models that can
accurately take in the data and thus predict a certain outcome. This is especially important
to the healthcare industry since it can considerably improve the lives of many patients.
Everything from detecting a type of disease, predicting the chance of morbidity after a
hospital stay, to aid in the decision making of treatment strategies are vital to patients as
well as to clinicians.
Any improvement over established methods that have been previously studied, tested
and published are an asset that will improve the patient’s satisfaction about the healthcare
performance in medical institutions. This can be achieved by refining those algorithms or
implementing new approaches that will make better predictions on the given data.
The main objective of this dissertation is to propose ML approaches having acknowledged and evaluated the existent methods used in clinical data. In order to fulfill this goal,
an analysis of the state of the art of medical knowledge repositories and scientific papers
published related to the selected keywords selected was performed. In this line of work,
it is crucial to understand, compare and discuss the results obtained to those previously
published. Thus, one of the goals is to suggest new ways of solving those problems and
measuring them up against the existent ones.Obter um sucesso enorme e inegável numa grande variedade de indústrias e companhias, tomou o termo Big Data (BD) muito popular entre a comunidade cientÃfica. Big Data refere-se à área de investigação em Engenharia Informática que revela um crescimento rápido e está envolvida em várias áreas em todo o mundo. O setor da saúde é universalmente con-hecido por ser altamente frutÃfero na produção de grandes quantidades de dados. Podem variar desde dados de saúde, tais como, o valor da pressão sanguÃnea e nÃvel de coles-terol do paciente, até dados mais confidenciais, como o histórico de cirurgias realizadas e doenças diagnosticadas. A aplicação de técnicas sofisticadas permite uma análise profunda e rigorosa dos dados -algo que um ser humano não consegue fazer em tempo real. No entanto, uma máquina não tem dificuldades em recolher, agrupar, armazenar e analisar rapidamente grandes quanti-dades de dados e extrair deles conhecimento que era desconhecido e, possivelmente, interessante. Os algoritmos usados podem ser usados para descobrir relações desconhecidas entre os vários atributos, que se podem revelar bastante úteis para o dia-a-dia de uma empresa. Estruturas e padrões escondidos nos dados podem ser também detetados através das mesmas técnicas. Os métodos de Machine Learning (ML) podem ser ajustados e modela-dos de forma a aceitar diferentes representações de dados de entrada - esta adaptabilidade é um dos fatores mais proeminentes que contribui para a sua prosperidade. O principal objetivo é fazer previsões sobre os dados, de modo a construir modelos totalmente eficientes que possam analisar os dados de forma precisa, e, assim, prever um determinado resultado. Isto é especialmente importante para o setor da saúde, uma vez que pode melhorar consideravelmente a vida de muitos pacientes. Tudo, desde a deteção de um certo tipo de doença, prever a probabilidade de morbilidade após um internamento até a auxiliar na tomada de decisão em relação a estratégias de tratamento, é vital para os pacientes, bem como para os médicos. Portanto, qualquer melhoria em relação a métodos já estabelecidos que foram previamente estudados, testados e publicados é uma mais-valia que melhorará a satisfação do paciente em relação à sua experiência com os serviços de saúde. Tal pode ser alcançado refinando esses algoritmos ou mesmo implementando novas abordagens que farão melhores previsões sobre os dados. O principal objetivo desta dissertação é propor abordagens de ML, fazendo um reconhecimento e avaliando os métodos existentes utilizados em dados médicos. Desta forma, foi posta em prática uma análise ao estado da arte de repositórios de conhecimento médico, bem como a artigos cientÃficos relacionados com esses conjuntos de dados. Assim, é fundamental compreender, comparar e discutir os resultados obtidos com os publicados anteriormente. Portanto, um dos objetivos é sugerir novas formas de resolver os problemas, tecendo uma comparação com os existentes
Painting a Picture of the Ovarian Cancer N-Glycome
Our story begins with the current clinical strategies that are used by clinicians today to (1) screen and detect ovarian cancer in the early-stages, (2) monitor treatment effectiveness, (3) detect ovarian cancer recurrence and (4) stratify ovarian cancer patients. However, these current strategies utilise FDA-approved ovarian cancer biomarkers, such as CA125 and HE4, which are relatively unsuccessful and lack specificity, especially for early stage patients. In the introduction, it is highlighted that these ovarian cancer biomarkers and other disease biomarkers are typically glycoproteins, but their glycan structure-protein function relationship remains unknown. Protein glycosylation is one of the most complex post-translational modifications (PTMs) found in humans, with N-linked glycans playing a significant role in protein folding and conformation, protein stability and activity, cell-cell interaction, and cell signalling pathways. The best approach to study and analyse N-glycans so far has been to structurally characterise them by firstly, releasing them from complex glycoprotein mixtures using PNGase F, and secondly, identifying specific structures using analytical strategies that may potentially translate into clinical strategies. Ultimately, this thesis focuses on analytical techniques, primarily mass spectrometry, that are available to qualitatively and quantitatively assess N-glycosylation while successfully characterising compositional, structural and linkage features with high specificity and sensitivity. Analytical techniques that were explored include liquid chromatography electrospray ionisation tandem mass spectrometry (LC-ESI-MS/MS) and matrix-assisted laser desorption/ionisation time-of-flight mass spectrometry (MALDI-TOF-MS). These analytical techniques have previously been implemented in the clinic for other diseases, however, not yet for ovarian cancer. It may be possible to implement either LC-ESI-MS/MS or MALDI-TOF-MS in the clinic for ovarian cancer using N-glycomic-based approaches since aberrant N-glycosylation patterns have been observed consistently between clinical samples, such as serum, plasma, ascites and tissue. MALDI mass spectrometry imaging (MSI) has emerged as a platform to visualise N-glycans in tissue-specific regions. Outlined in this thesis, our group studied the intrapatient and interpatient variability between early- and late-stage ovarian cancer patients. From our studies, specific N-glycan differences were identified between the early- and late-stage tumour microenvironment that could lead to the development of ovarian cancer diagnosis and prognosis strategies for the clinic.Thesis (Ph.D.) -- University of Adelaide, School of Biological Sciences, 201
- …