37 research outputs found

    Expert cancer model using supervised algorithms with a LASSO selection approach

    Get PDF
    One of the most critical issues of the mortality rate in the medical field in current times is breast cancer. Nowadays, a large number of men and women is facing cancer-related deaths due to the lack of early diagnosis systems and proper treatment per year. To tackle the issue, various data mining approaches have been analyzed to build an effective model that helps to identify the different stages of deadly cancers. The study successfully proposes an early cancer disease model based on five different supervised algorithms such as logistic regression (henceforth LR), decision tree (henceforth DT), random forest (henceforth RF), Support vector machine (henceforth SVM), and K-nearest neighbor (henceforth KNN). After an appropriate preprocessing of the dataset, least absolute shrinkage and selection operator (LASSO) was used for feature selection (FS) using a 10-fold cross-validation (CV) approach. Employing LASSO with 10-fold cross-validation has been a novel steps introduced in this research. Afterwards, different performance evaluation metrics were measured to show accurate predictions based on the proposed algorithms. The result indicated top accuracy was received from RF classifier, approximately 99.41% with the integration of LASSO. Finally, a comprehensive comparison was carried out on Wisconsin breast cancer (diagnostic) dataset (WBCD) together with some current works containing all features

    Improving k-nn search and subspace clustering based on local intrinsic dimensionality

    Get PDF
    In several novel applications such as multimedia and recommender systems, data is often represented as object feature vectors in high-dimensional spaces. The high-dimensional data is always a challenge for state-of-the-art algorithms, because of the so-called curse of dimensionality . As the dimensionality increases, the discriminative ability of similarity measures diminishes to the point where many data analysis algorithms, such as similarity search and clustering, that depend on them lose their effectiveness. One way to handle this challenge is by selecting the most important features, which is essential for providing compact object representations as well as improving the overall search and clustering performance. Having compact feature vectors can further reduce the storage space and the computational complexity of search and learning tasks. Support-Weighted Intrinsic Dimensionality (support-weighted ID) is a new promising feature selection criterion that estimates the contribution of each feature to the overall intrinsic dimensionality. Support-weighted ID identifies relevant features locally for each object, and penalizes those features that have locally lower discriminative power as well as higher density. In fact, support-weighted ID measures the ability of each feature to locally discriminate between objects in the dataset. Based on support-weighted ID, this dissertation introduces three main research contributions: First, this dissertation proposes NNWID-Descent, a similarity graph construction method that utilizes the support-weighted ID criterion to identify and retain relevant features locally for each object and enhance the overall graph quality. Second, with the aim to improve the accuracy and performance of cluster analysis, this dissertation introduces k-LIDoids, a subspace clustering algorithm that extends the utility of support-weighted ID within a clustering framework in order to gradually select the subset of informative and important features per cluster. k-LIDoids is able to construct clusters together with finding a low dimensional subspace for each cluster. Finally, using the compact object and cluster representations from NNWID-Descent and k-LIDoids, this dissertation defines LID-Fingerprint, a new binary fingerprinting and multi-level indexing framework for the high-dimensional data. LID-Fingerprint can be used for hiding the information as a way of preventing passive adversaries as well as providing an efficient and secure similarity search and retrieval for the data stored on the cloud. When compared to other state-of-the-art algorithms, the good practical performance provides an evidence for the effectiveness of the proposed algorithms for the data in high-dimensional spaces

    ROBUST DETECTION OF CORONARY HEART DISEASE USING MACHINE LEARNING ALGORITHMS

    Get PDF
    Predicting whether or not someone will get heart or cardiac disease is now one of the most difficult jobs in the area of medicine. Heart disease is responsible for the deaths of about one person per minute in the contemporary age. Processing the vast amounts of data that are generated in the field of healthcare is an important application for data science. Because predicting cardiac disease is a difficult undertaking, there is a pressing need to automate the prediction process to minimize the dangers that are connected with it and provide the patient with timely warning. The chapter one in this thesis report highlights the importance of this problem and identifies the need to augment the current technological efforts to produce relatively more accurate system in facilitating the timely decision about the problem. The chapter one also presents the current literature about the theories and systems developed and assessed in this direction.This thesis work makes use of the dataset on cardiac illness that can be found in the machine learning repository at UCI. Using a variety of data mining strategies, such as Naive Bayes, Decision Tree, Support Vector Machine (SVM), K-Nearest Neighbor (K-NN), and Random Forest, the work that has been reported in this thesis estimates the likelihood that a patient would develop heart disease and can categorize the patient\u27s degree of risk. The performance of chosen classifiers is tested on chosen feature space with help of feature selection algorithm. On Cleveland heart datasets of heart disease, the models were placed for training and testing. To assess the usefulness and strength of each model, several performance metrics are utilized, including sensitivity, accuracy, AUC, specificity, ROC curve and F1-score. The effort behind this research leads to conduct a comparative analysis by computing the performance of several machine learning algorithms. The results of the experiment demonstrate that the Random Forest and Support Vector machine algorithms achieved the best level of accuracy (94.50% and 91.73% respectively) on selected feature space when compared to the other machine learning methods that were employed. Thus, these two classifiers turned out to be promising classifiers for heart disease prediction. The computational complexity of each classifier was also investigated. Based on the computational complexity and comparative experimental results, a robust heart disease prediction is proposed for an embedded platform, where benefits of multiple classifiers are accumulated. The system proposes that heart disease detection is possible with higher confidence if and only if many of these classifiers detect it. In the end, results of experimental work are concluded and possible future strategies in enhancing this effort are discussed

    A Machine Learning Approach to Indoor Localization Data Mining

    Get PDF
    Indoor positioning systems are increasingly commonplace in various environments and produce large quantities of data. They are used in industrial applications, robotics, asset and employee tracking just to name a few use cases. The growing amount of data and the accelerating progress of machine learning opens up many new possibilities for analyzing this data in ways that were not conceivable or relevant before. This paper introduces connected concepts and implementations to answer question how this data can be utilized. Data gathered in this thesis originates from an indoor positioning system deployed in retail environment, but the discussed methods can be applied generally. The issue will be approached by first introducing the concept of machine learning and more generally, artificial intelligence, and how they work on a general level. A deeper dive is done to subfields and algorithms that are relevant to the data mining task at hand. Indoor positioning system basics are also shortly discussed to create a base understanding on the realistic capabilities and constraints that these kinds of systems encase. These methods and previous knowledge from literature are put to test with the freshly gathered data. An algorithm based on existing example from literature was tested and improved upon with the new data. A novel method to cluster and classify movement patterns was introduced, utilizing deep learning to create embedded representations of the trajectories in a more complex learning pipeline. This type of learning is often referred to as deep clustering. The results are promising and both of the methods produce useful high level representations of the complex dataset that can help a human operator to discern the relevant patterns from raw data and to be used as an input for subsequent supervised and unsupervised learning steps. Several factors related to optimizing the learning pipeline, such as regularization were also researched and the results presented as visualizations. The research found that pipeline consisting of CNN-autoencoder followed by a classic clustering algorithm such as DBSCAN produces useful results in the form of trajectory clusters. Regularization such as L1 regression improves this performance. The research done in this paper presents useful algorithms for processing raw, noisy localization data from indoor environments that can be used for further implementations in both industrial applications and academia

    Latent Factor Analysis of High-Dimensional Brain Imaging Data

    Get PDF
    Recent advances in neuroimaging study, especially functional magnetic resonance imaging (fMRI), has become an important tool in understanding the human brain. Human cognitive functions can be mapped with the brain functional organization through the high-resolution fMRI scans. However, the high-dimensional data with the increasing number of scanning tasks and subjects pose a challenge to existing methods that wasn’t optimized for high-dimensional imaging data. In this thesis, I develop advanced data-driven methods to help utilize more available sources of information in order to reveal more robust brain-behavior relationship. In the first chapter, I provide an overview of the current related research in fMRI and my contributions to the field. In the second chapter, I propose two extensions to the connectome-based predictive modeling (CPM) method that is able to combine multiple connectomes when building predictive models. The two extensions are both able to generate higher prediction accuracy than using the single connectome or the average of multiple connectomes, suggesting the advantage of incorporating multiple sources of information in predictive modeling. In the third chapter, I improve CPM from the target behavioral measure’s perspective. I propose another two extensions for CPM that are able to combine multiple available behavioral measures into a composite measure for CPM to predict. The derived composite measures are shown to be predicted more accurately than any other single behavioral measure, suggesting a more robust brainbehavior relationship. In the fourth chapter, I propose a nonlinear dimensionality reduction framework to embed fMRI data from multiple tasks into a low-dimensional space. This framework helps reveal the common brain state in the multiple available tasks while also help discover the differences among these tasks. The results also provide valuable insights into the various prediction performance based on connectomes from different tasks. In the fifth chapter, I propose an another hyerbolic geometry-based brain graph edge embedding framework. The framework is based on Poincar´e embedding and is able to more accurately represent edges in the brain graph in a low-dimensional space than traditional Euclidean geometry-based embedding. Utilizing the embedding, we are able to cluster edges of the brain graph into disjoint clusters. The edge clusters can then be used to define overlapping brain networks and the derived metrics like network overlapping number can be used to investigate functional flexibility of each brain region. Overall, these work provide rich data-driven methods that help understand the brain-behavioral relationship through predictive modeling and low-dimensional data representation

    Applications

    Get PDF
    Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications

    ANALYSIS AND SIMULATION OF TANDEM MASS SPECTROMETRY DATA

    Get PDF
    This dissertation focuses on improvements to data analysis in mass spectrometry-based proteomics, which is the study of an organism’s full complement of proteins. One of the biggest surprises from the Human Genome Project was the relatively small number of genes (~20,000) encoded in our DNA. Since genes code for proteins, scientists expected more genes would be necessary to produce a diverse set of proteins to cover the many functions that support the complexity of life. Thus, there is intense interest in studying proteomics, including post-translational modifications (how proteins change after translation from their genes), and their interactions (e.g. proteins binding together to form complex molecular machines) to fill the void in molecular diversity. The goal of mass spectrometry in proteomics is to determine the abundance and amino acid sequence of every protein in a biological sample. A mass spectrometer can determine mass/charge ratios and abundance for fragments of short peptides (which are subsequences of a protein); sequencing algorithms determine which peptides are most likely to have generated the fragmentation patterns observed in the mass spectrum, and protein identity is inferred from the peptides. My work improves the computational tools for mass spectrometry by removing limitations on present algorithms, simulating mass spectroscopy instruments to facilitate algorithm development, and creating algorithms that approximate isotope distributions, deconvolve chimeric spectra, and predict protein-protein interactions. While most sequencing algorithms attempt to identify a single peptide per mass spectrum, multiple peptides are often fragmented together. Here, I present a method to deconvolve these chimeric mass spectra into their individual peptide components by examining the isotopic distributions of their fragments. First, I derived the equation to calculate the theoretical isotope distribution of a peptide fragment. Next, for cases where elemental compositions are not known, I developed methods to approximate the isotope distributions. Ultimately, I created a non-negative least squares model that deconvolved chimeric spectra and increased peptide-spectrum-matches by 15-30%. To improve the operation of mass spectrometer instruments, I developed software that simulates liquid chromatography-mass spectrometry data and the subsequent execution of custom data acquisition algorithms. The software provides an opportunity for researchers to test, refine, and evaluate novel algorithms prior to implementation on a mass spectrometer. Finally, I created a logistic regression classifier for predicting protein-protein interactions defined by affinity purification and mass spectrometry (APMS). The classifier increased the area under the receiver operating characteristic curve by 16% compared to previous methods. Furthermore, I created a web application to facilitate APMS data scoring within the scientific community.Doctor of Philosoph
    corecore