963 research outputs found
Integrated bio-search approaches with multi-objective algorithms for optimization and classification problem
Optimal selection of features is very difficult and crucial to achieve, particularly for the task of classification. It is due to the traditional method of selecting features that function independently and generated the collection of irrelevant features, which therefore affects the quality of the accuracy of the classification. The goal of this paper is to leverage the potential of bio-inspired search algorithms, together with wrapper, in optimizing multi-objective algorithms, namely ENORA and NSGA-II to generate an optimal set of features. The main steps are to idealize the combination of ENORA and NSGA-II with suitable bio-search algorithms where multiple subset generation has been implemented. The next step is to validate the optimum feature set by conducting a subset evaluation. Eight (8) comparison datasets of various sizes have been deliberately selected to be checked. Results shown that the ideal combination of multi-objective algorithms, namely ENORA and NSGA-II, with the selected bio-inspired search algorithm is promising to achieve a better optimal solution (i.e. a best features with higher classification accuracy) for the selected datasets. This discovery implies that the ability of bio-inspired wrapper/filtered system algorithms will boost the efficiency of ENORA and NSGA-II for the task of selecting and classifying features
Multi-modality machine learning predicting Parkinson's disease
Personalized medicine promises individualized disease prediction and treatment. The convergence of machine learning (ML) and available multimodal data is key moving forward. We build upon previous work to deliver multimodal predictions of Parkinson's disease (PD) risk and systematically develop a model using GenoML, an automated ML package, to make improved multi-omic predictions of PD, validated in an external cohort. We investigated top features, constructed hypothesis-free disease-relevant networks, and investigated drug-gene interactions. We performed automated ML on multimodal data from the Parkinson's progression marker initiative (PPMI). After selecting the best performing algorithm, all PPMI data was used to tune the selected model. The model was validated in the Parkinson's Disease Biomarker Program (PDBP) dataset. Our initial model showed an area under the curve (AUC) of 89.72% for the diagnosis of PD. The tuned model was then tested for validation on external data (PDBP, AUC 85.03%). Optimizing thresholds for classification increased the diagnosis prediction accuracy and other metrics. Finally, networks were built to identify gene communities specific to PD. Combining data modalities outperforms the single biomarker paradigm. UPSIT and PRS contributed most to the predictive power of the model, but the accuracy of these are supplemented by many smaller effect transcripts and risk SNPs. Our model is best suited to identifying large groups of individuals to monitor within a health registry or biobank to prioritize for further testing. This approach allows complex predictive models to be reproducible and accessible to the community, with the package, code, and results publicly available
Machine Learning for Multiclass Classification and Prediction of Alzheimer\u27s Disease
Alzheimer\u27s disease (AD) is an irreversible neurodegenerative disorder and a common form of dementia. This research aims to develop machine learning algorithms that diagnose and predict the progression of AD from multimodal heterogonous biomarkers with a focus placed on the early diagnosis. To meet this goal, several machine learning-based methods with their unique characteristics for feature extraction and automated classification, prediction, and visualization have been developed to discern subtle progression trends and predict the trajectory of disease progression.
The methodology envisioned aims to enhance both the multiclass classification accuracy and prediction outcomes by effectively modeling the interplay between the multimodal biomarkers, handle the missing data challenge, and adequately extract all the relevant features that will be fed into the machine learning framework, all in order to understand the subtle changes that happen in the different stages of the disease. This research will also investigate the notion of multitasking to discover how the two processes of multiclass classification and prediction relate to one another in terms of the features they share and whether they could learn from one another for optimizing multiclass classification and prediction accuracy.
This research work also delves into predicting cognitive scores of specific tests over time, using multimodal longitudinal data. The intent is to augment our prospects for analyzing the interplay between the different multimodal features used in the input space to the predicted cognitive scores. Moreover, the power of modality fusion, kernelization, and tensorization have also been investigated to efficiently extract important features hidden in the lower-dimensional feature space without being distracted by those deemed as irrelevant.
With the adage that a picture is worth a thousand words, this dissertation introduces a unique color-coded visualization system with a fully integrated machine learning model for the enhanced diagnosis and prognosis of Alzheimer\u27s disease. The incentive here is to show that through visualization, the challenges imposed by both the variability and interrelatedness of the multimodal features could be overcome. Ultimately, this form of visualization via machine learning informs on the challenges faced with multiclass classification and adds insight into the decision-making process for a diagnosis and prognosis
Recommended from our members
Identification of Parkinsons disease PACE subtypes and repurposing treatments through integrative analyses of multimodal data.
Parkinsons disease (PD) is a serious neurodegenerative disorder marked by significant clinical and progression heterogeneity. This study aimed at addressing heterogeneity of PD through integrative analysis of various data modalities. We analyzed clinical progression data (≥5 years) of individuals with de novo PD using machine learning and deep learning, to characterize individuals phenotypic progression trajectories for PD subtyping. We discovered three pace subtypes of PD exhibiting distinct progression patterns: the Inching Pace subtype (PD-I) with mild baseline severity and mild progression speed; the Moderate Pace subtype (PD-M) with mild baseline severity but advancing at a moderate progression rate; and the Rapid Pace subtype (PD-R) with the most rapid symptom progression rate. We found cerebrospinal fluid P-tau/α-synuclein ratio and atrophy in certain brain regions as potential markers of these subtypes. Analyses of genetic and transcriptomic profiles with network-based approaches identified molecular modules associated with each subtype. For instance, the PD-R-specific module suggested STAT3, FYN, BECN1, APOA1, NEDD4, and GATA2 as potential driver genes of PD-R. It also suggested neuroinflammation, oxidative stress, metabolism, PI3K/AKT, and angiogenesis pathways as potential drivers for rapid PD progression (i.e., PD-R). Moreover, we identified repurposable drug candidates by targeting these subtype-specific molecular modules using network-based approach and cell line drug-gene signature data. We further estimated their treatment effects using two large-scale real-world patient databases; the real-world evidence we gained highlighted the potential of metformin in ameliorating PD progression. In conclusion, this work helps better understand clinical and pathophysiological complexity of PD progression and accelerate precision medicine
A comparative study of model-centric and data-centric approaches in the development of cardiovascular disease risk prediction models in the UK Biobank
Aims
A diverse set of factors influence cardiovascular diseases (CVDs), but a systematic investigation of the interplay between these determinants and the contribution of each to CVD incidence prediction is largely missing from the literature. In this study, we leverage one of the most comprehensive biobanks worldwide, the UK Biobank, to investigate the contribution of different risk factor categories to more accurate incidence predictions in the overall population, by sex, different age groups, and ethnicity.
Methods and results
The investigated categories include the history of medical events, behavioural factors, socioeconomic factors, environmental factors, and measurements. We included data from a cohort of 405 257 participants aged 37–73 years and trained various machine learning and deep learning models on different subsets of risk factors to predict CVD incidence. Each of the models was trained on the complete set of predictors and subsets where each category was excluded. The results were benchmarked against QRISK3. The findings highlight that (i) leveraging a more comprehensive medical history substantially improves model performance. Relative to QRISK3, the best performing models improved the discrimination by 3.78% and improved precision by 1.80%. (ii) Both model- and data-centric approaches are necessary to improve predictive performance. The benefits of using a comprehensive history of diseases were far more pronounced when a neural sequence model, BEHRT, was used. This highlights the importance of the temporality of medical events that existing clinical risk models fail to capture. (iii) Besides the history of diseases, socioeconomic factors and measurements had small but significant independent contributions to the predictive performance.
Conclusion
These findings emphasize the need for considering broad determinants and novel modelling approaches to enhance CVD incidence prediction
IoT in smart communities, technologies and applications.
Internet of Things is a system that integrates different devices and technologies, removing the necessity of human intervention. This enables the capacity of having smart (or smarter) cities around the world. By hosting different technologies and allowing interactions between them, the internet of things has spearheaded the development of smart city systems for sustainable living, increased comfort and productivity for citizens. The Internet of Things (IoT) for Smart Cities has many different domains and draws upon various underlying systems for its operation, in this work, we provide a holistic coverage of the Internet of Things in Smart Cities by discussing the fundamental components that make up the IoT Smart City landscape, the technologies that enable these domains to exist, the most prevalent practices and techniques which are used in these domains as well as the challenges that deployment of IoT systems for smart cities encounter and which need to be addressed for ubiquitous use of smart city applications. It also presents a coverage of optimization methods and applications from a smart city perspective enabled by the Internet of Things. Towards this end, a mapping is provided for the most encountered applications of computational optimization within IoT smart cities for five popular optimization methods, ant colony optimization, genetic algorithm, particle swarm optimization, artificial bee colony optimization and differential evolution. For each application identified, the algorithms used, objectives considered, the nature of the formulation and constraints taken in to account have been specified and discussed. Lastly, the data setup used by each covered work is also mentioned and directions for future work have been identified. Within the smart health domain of IoT smart cities, human activity recognition has been a key study topic in the development of cyber physical systems and assisted living applications. In particular, inertial sensor based systems have become increasingly popular because they do not restrict users’ movement and are also relatively simple to implement compared to other approaches. Fall detection is one of the most important tasks in human activity recognition. With an increasingly aging world population and an inclination by the elderly to live alone, the need to incorporate dependable fall detection schemes in smart devices such as phones, watches has gained momentum. Therefore, differentiating between falls and activities of daily living (ADLs) has been the focus of researchers in recent years with very good results. However, one aspect within fall detection that has not been investigated much is direction and severity aware fall detection. Since a fall detection system aims to detect falls in people and notify medical personnel, it could be of added value to health professionals tending to a patient suffering from a fall to know the nature of the accident. In this regard, as a case study for smart health, four different experiments have been conducted for the task of fall detection with direction and severity consideration on two publicly available datasets. These four experiments not only tackle the problem on an increasingly complicated level (the first one considers a fall only scenario and the other two a combined activity of daily living and fall scenario) but also present methodologies which outperform the state of the art techniques as discussed. Lastly, future recommendations have also been provided for researchers
Aco-based feature selection algorithm for classification
Dataset with a small number of records but big number of attributes represents a phenomenon called “curse of dimensionality”. The classification of this type of dataset requires Feature Selection (FS) methods for the extraction of useful information. The modified graph clustering ant colony optimisation (MGCACO) algorithm is an effective FS method that was developed based on grouping the highly correlated features. However, the MGCACO algorithm has three main drawbacks in producing a features subset because of its clustering method, parameter sensitivity, and the final subset determination. An enhanced graph clustering ant colony optimisation (EGCACO) algorithm is proposed to solve the three (3) MGCACO algorithm problems. The proposed improvement includes: (i) an ACO feature clustering method to obtain clusters of highly correlated features; (ii) an adaptive selection technique for subset construction from the clusters of features; and (iii) a genetic-based method for producing the final subset of features. The ACO feature clustering method utilises the ability of various mechanisms such as intensification and diversification for local and global optimisation to provide highly correlated features. The adaptive technique for ant selection enables the parameter to adaptively change based on the feedback of the search space. The genetic method determines the final subset, automatically, based on the crossover and subset quality calculation. The performance of the proposed algorithm was evaluated on 18 benchmark datasets from the University California Irvine (UCI) repository and nine (9) deoxyribonucleic acid (DNA) microarray datasets against 15 benchmark metaheuristic algorithms. The experimental results of the EGCACO algorithm on the UCI dataset are superior to other benchmark optimisation algorithms in terms of the number of selected features for 16 out of the 18 UCI datasets (88.89%) and the best in eight (8) (44.47%) of the datasets for classification accuracy. Further, experiments on the nine (9) DNA microarray datasets showed that the EGCACO algorithm is superior than the benchmark algorithms in terms of classification accuracy (first rank) for seven (7) datasets (77.78%) and demonstrates the lowest number of selected features in six (6) datasets (66.67%). The proposed EGCACO algorithm can be utilised for FS in DNA microarray classification tasks that involve large dataset size in various application domains
Recommended from our members
Electronic Health Record-Derived Phenotyping Models to Improve Genomic Research in Stroke
Stroke is a highly heterogeneous and complex disease that is a leading cause of death in the United States. The landscape of risk factors for stroke is vast, and its large genetic burden has yet to be fully discovered. We hypothesize that the small number of stroke variants recovered so far is due to 1) the vast phenotypic heterogeneity of stroke and 2) binary labeling of stroke genome-wide association study (GWAS) participants as cases or controls. Specifically, genome-wide association studies accumulate hundreds of thousands to millions of participants to acquire adequate signal for variant discovery. This requires time-consuming manual curation of cases and controls often involving large-scale collaborations. Genetic biobanks connected to electronic health records (EHR) can facilitate these studies by using data routinely captured during clinical care like billing diagnosis codes. These data, however, do not define adjudicated cases and controls, with many patients falling somewhere in between. There is an opportunity to use machine learning to add nuance to these definitions. We hypothesize that an expanded definition of disease by incorporating correlated diseases and risk factors from EHR data will improve GWAS power. We also hypothesize that granularly subtyping stroke using unsupervised learning methods can provide insight into stroke etiology and heterogeneity. In Chapter 1, we described the motivation for building upon current phenotyping methods for subtyping and genome-wide association studies to improve GWAS power. In Chapter 2, using patients from Columbia-New York Presbyterian (NYP) Hospital, we built and evaluated machine learning models to identify patients with acute ischemic stroke based on 75 different case-control and classifier combinations. In chapter 3, we compared two data-driven and unsupervised methods, non-negative matrix factorization (NMF) and Hierarchical Poisson Factorization, to subtype stroke patients and determined whether any of the subtypes correlate to stroke severity. In chapter 4, we estimated the heritability of acute ischemic stroke by treating the patient probabilities assigned by the machine learning phenotyping models for acute ischemic stroke in chapter 2 as a quantitative trait and mapping the probabilities to Columbia-NYP EHR-generated pedigrees. We also applied our machine learning phenotyping algorithm method, which we call QTPhenProxy, to venous thromboembolism on Columbia eMERGE Consortium patients and ran a genome-wide association study using the model probabilities as a quantitative trait. Finally, we applied QTPhenProxy to subjects in the UK Biobank for stroke and 14 other diseases and ran genome-wide association studies for each disease. We found that our machine-learned models performed well in identifying acute ischemic stroke patients in the Columbia-NYP EHR and in the UK Biobank. We also found some NMF-derived subtypes that were significantly correlated with stroke severity. We were underpowered in the eMERGE venous thromboembolism cohort GWAS and did not recover any known or new variants. Finally, we found that QTPhenProxy improved the power of GWAS of stroke and several subtypes in the UK Biobank, recovered known variants, and discovered a new variant that replicates in a previous stroke GWAS. Our results for QTPhenProxy demonstrate the promise of incorporating large but messy sets of data, such as the electronic health record, to improve signal in genome-wide association studies
UNCOVERING PATTERNS IN COMPLEX DATA WITH RESERVOIR COMPUTING AND NETWORK ANALYTICS: A DYNAMICAL SYSTEMS APPROACH
In this thesis, we explore methods of uncovering underlying patterns in complex data, and making predictions, through machine learning and network science.
With the availability of more data, machine learning for data analysis has advanced rapidly. However, there is a general lack of approaches that might allow us to 'open the black box'. In the machine learning part of this thesis, we primarily use an architecture called Reservoir Computing for time-series prediction and image classification, while exploring how information is encoded in the reservoir dynamics.
First, we investigate the ways in which a Reservoir Computer (RC) learns concepts such as 'similar' and 'different', and relationships such as 'blurring', 'rotation' etc. between image pairs, and generalizes these concepts to different classes unseen during training. We observe that the high dimensional reservoir dynamics display different patterns for different relationships. This clustering allows RCs to perform significantly better in generalization with limited training compared with state-of-the-art pair-based convolutional/deep Siamese Neural Networks.
Second, we demonstrate the utility of an RC in the separation of superimposed chaotic signals. We assume no knowledge of the dynamical equations that produce the signals, and require only that the training data consist of finite time samples of the component signals. We find that our method significantly outperforms the optimal linear solution to the separation problem, the Wiener filter.
To understand how representations of signals are encoded in an RC during learning, we study its dynamical properties when trained to predict chaotic Lorenz signals. We do so by using a novel, mathematical fixed-point-finding technique called directional fibers. We find that, after training, the high dimensional RC dynamics includes fixed points that map to the known Lorenz fixed points, but the RC also has spurious fixed points, which are relevant to how its predictions break down.
While machine learning is a useful data processing tool, its success often relies on a useful representation of the system's information. In contrast, systems with a large numbers of interacting components may be better analyzed by modeling them as networks. While numerous advances in network science have helped us analyze such systems, tools that identify properties on networks modeling multi-variate time-evolving data (such as disease data) are limited. We close this gap by introducing a novel data-driven, network-based Trajectory Profile Clustering (TPC) algorithm for 1) identification of disease subtypes and 2) early prediction of subtype/disease progression patterns. TPC identifies subtypes by clustering patients with similar disease trajectory profiles derived from bipartite patient-variable networks. Applying TPC to a Parkinson’s dataset, we identify 3 distinct subtypes. Additionally, we show that TPC predicts disease subtype 4 years in advance with 74% accuracy
- …