370 research outputs found

    Big Data Analytics for Complex Systems

    Get PDF
    The evolution of technology in all fields led to the generation of vast amounts of data by modern systems. Using data to extract information, make predictions, and make decisions is the current trend in artificial intelligence. The advancement of big data analytics tools made accessing and storing data easier and faster than ever, and machine learning algorithms help to identify patterns in and extract information from data. The current tools and machines in health, computer technologies, and manufacturing can generate massive raw data about their products or samples. The author of this work proposes a modern integrative system that can utilize big data analytics, machine learning, super-computer resources, and industrial health machines’ measurements to build a smart system that can mimic the human intelligence skills of observations, detection, prediction, and decision-making. The applications of the proposed smart systems are included as case studies to highlight the contributions of each system. The first contribution is the ability to utilize big data revolutionary and deep learning technologies on production lines to diagnose incidents and take proper action. In the current digital transformational industrial era, Industry 4.0 has been receiving researcher attention because it can be used to automate production-line decisions. Reconfigurable manufacturing systems (RMS) have been widely used to reduce the setup cost of restructuring production lines. However, the current RMS modules are not linked to the cloud for online decision-making to take the proper decision; these modules must connect to an online server (super-computer) that has big data analytics and machine learning capabilities. The online means that data is centralized on cloud (supercomputer) and accessible in real-time. In this study, deep neural networks are utilized to detect the decisive features of a product and build a prediction model in which the iFactory will make the necessary decision for the defective products. The Spark ecosystem is used to manage the access, processing, and storing of the big data streaming. This contribution is implemented as a closed cycle, which for the best of our knowledge, no one in the literature has introduced big data analysis using deep learning on real-time applications in the manufacturing system. The code shows a high accuracy of 97% for classifying the normal versus defective items. The second contribution, which is in Bioinformatics, is the ability to build supervised machine learning approaches based on the gene expression of patients to predict proper treatment for breast cancer. In the trial, to personalize treatment, the machine learns the genes that are active in the patient cohort with a five-year survival period. The initial condition here is that each group must only undergo one specific treatment. After learning about each group (or class), the machine can personalize the treatment of a new patient by diagnosing the patients’ gene expression. The proposed model will help in the diagnosis and treatment of the patient. The future work in this area involves building a protein-protein interaction network with the selected genes for each treatment to first analyze the motives of the genes and target them with the proper drug molecules. In the learning phase, a couple of feature-selection techniques and supervised standard classifiers are used to build the prediction model. Most of the nodes show a high-performance measurement where accuracy, sensitivity, specificity, and F-measure ranges around 100%. The third contribution is the ability to build semi-supervised learning for the breast cancer survival treatment that advances the second contribution. By understanding the relations between the classes, we can design the machine learning phase based on the similarities between classes. In the proposed research, the researcher used the Euclidean matrix distance among each survival treatment class to build the hierarchical learning model. The distance information that is learned through a non-supervised approach can help the prediction model to select the classes that are away from each other to maximize the distance between classes and gain wider class groups. The performance measurement of this approach shows a slight improvement from the second model. However, this model reduced the number of discriminative genes from 47 to 37. The model in the second contribution studies each class individually while this model focuses on the relationships between the classes and uses this information in the learning phase. Hierarchical clustering is completed to draw the borders between groups of classes before building the classification models. Several distance measurements are tested to identify the best linkages between classes. Most of the nodes show a high-performance measurement where accuracy, sensitivity, specificity, and F-measure ranges from 90% to 100%. All the case study models showed high-performance measurements in the prediction phase. These modern models can be replicated for different problems within different domains. The comprehensive models of the newer technologies are reconfigurable and modular; any newer learning phase can be plugged-in at both ends of the learning phase. Therefore, the output of the system can be an input for another learning system, and a newer feature can be added to the input to be considered for the learning phase

    Identifying Network-Biomarkers of Breast Cancer Survivability

    Get PDF
    One of the key challenges of breast cancer research is to predict whether a patient identified with specific subtype or treated with a specific therapy is going to survive or die. Current studies find small subsets of gene biomarkers able to accurately predict the response to therapy. In these studies, the selected genes are not necessarily functionally related, and hence, they may not correctly indicate the molecular mechanism behind breast cancer survivability. Also, several studies have shown there is a very low overlap between the different respective biomarkers subsets for the same cancer disease. To improve the robustness of classification performance and stability of detected biomarkers, recent methods take existing knowledge on relations between genes into account in the classifier, by aggregating functionality related genes to produce discriminative gene subnetworks called network-biomarkers. In this paper, given a breast cancer dataset of patients with different subtypes treated with a given therapy drug, we devised network-based machine learning approach by integrating protein protein interaction network (PPI) with gene expression data (1) to identify the network-biomarkers of breast cancer survivability a) based on subtypes and b) based on therapy and (2) to predict the survivability of breast cancer patients a) based on subtypes b) treated with a therapy drug. We used the concept of seed gene for identification of network-biomarkers with distance 2, 3 and 4 from seed gene protein and our method found distance 3 and 44 are the distance that gives us best result for identifying survivability of breast cancer patient based on subtype and therapy respectively. To solve the class imbalance problem in some subtypes, we implemented ADASYN. We obtained best classification performance using random forest where the geometric mean, F1-measure and accuracy are respectively 0.867, 0.850 and 87.00% for subtype specific study, and 0.829, 0.807 and 83.77%, for therapy specific

    Integration of Data Mining Classification Techniques and Ensemble Learning for Predicting the Type of Breast Cancer Recurrence

    Get PDF
    Conservative surgery plus radiotherapy is an alternative to radical mastectomy in the early stages of breast cancer, presenting equivalent survival rates. Data mining facilitates to manage the data and provide the useful medical progression and treatment of cancerous conditions as these methods can help to reduce the number of false positive and false negative decisions. Various machine learning techniques can be used to support the doctors in effective and accurate decision making. In this paper, various classifiers have been tested for the prediction of type of breast cancer recurrence and the results show that neural networks outperform others

    Machine Learning Approaches for Breast Cancer Survivability Prediction

    Get PDF
    Breast cancer is one of the leading causes of cancer death in women. If not diagnosed early, the 5-year survival rate of patients is just about 26\%. Furthermore, patients with similar phenotypes can respond differently to the same therapies, which means the therapies might not work well for some of them. Identifying biomarkers that can help predict a cancer class with high accuracy is at the heart of breast cancer studies because they are targets of the treatments and drug development. Genomics data have been shown to carry useful information for breast cancer diagnosis and prognosis, as well as uncovering the disease’s mechanism. Machine learning methods are powerful tools to find such information. Feature selection methods are often utilized in supervised learning and unsupervised learning tasks to deal with data containing a large number of features in which only a small portion of them are useful to the classification task. On the other hand, analyzing only one type of data, without reference to the existing knowledge about the disease and the therapies, might mislead the findings. Effective data integration approaches are necessary to uncover this complex disease. In this thesis, we apply and develop machine learning methods to identify meaningful biomarkers for breast cancer survivability prediction after a certain treatment. They include applying feature selection methods on gene-expression data to derived gene-signatures, where the initial genes are collected concerning the mechanism of some drugs used breast cancer therapies. We also propose a new feature selection method, named PAFS, and apply it to discover accurate biomarkers. In addition, it has been increasingly supported that, sub-network biomarkers are more robust and accurate than gene biomarkers. We proposed two network-based approaches to identify sub-network biomarkers for breast cancer survivability prediction after a treatment. They integrate gene-expression data with protein-protein interactions during the optimal sub-network searching process and use cancer-related genes and pathways to prioritize the extracted sub-networks. The sub-network search space is usually huge and many proteins interact with thousands of other proteins. Thus, we apply some heuristics to avoid generating and evaluating redundant sub-networks

    Gene expression profiling of breast cancer survivability by pooled cDNA microarray analysis using logistic regression, artificial neural networks and decision trees

    Get PDF
    BACKGROUND: Microarray technology can acquire information about thousands of genes simultaneously. We analyzed published breast cancer microarray databases to predict five-year recurrence and compared the performance of three data mining algorithms of artificial neural networks (ANN), decision trees (DT) and logistic regression (LR) and two composite models of DT-ANN and DT-LR. The collection of microarray datasets from the Gene Expression Omnibus, four breast cancer datasets were pooled for predicting five-year breast cancer relapse. After data compilation, 757 subjects, 5 clinical variables and 13,452 genetic variables were aggregated. The bootstrap method, Mann–Whitney U test and 20-fold cross-validation were performed to investigate candidate genes with 100 most-significant p-values. The predictive powers of DT, LR and ANN models were assessed using accuracy and the area under ROC curve. The associated genes were evaluated using Cox regression. RESULTS: The DT models exhibited the lowest predictive power and the poorest extrapolation when applied to the test samples. The ANN models displayed the best predictive power and showed the best extrapolation. The 21 most-associated genes, as determined by integration of each model, were analyzed using Cox regression with a 3.53-fold (95% CI: 2.24-5.58) increased risk of breast cancer five-year recurrence… CONCLUSIONS: The 21 selected genes can predict breast cancer recurrence. Among these genes, CCNB1, PLK1 and TOP2A are in the cell cycle G2/M DNA damage checkpoint pathway. Oncologists can offer the genetic information for patients when understanding the gene expression profiles on breast cancer recurrence

    iSOM-GSN: An Integrative Approach for Transforming Multi-omic Data into Gene Similarity Networks via Self-organizing Maps

    Get PDF
    Deep learning models are currently applied in diverse domains, including image recognition, text generation, and event prediction. With the advent of new high-throughput sequencing technologies, a multitude of genomic data has been generated and made available. The representation of such data using deep neural networks, or for that matter, application of differential analysis has, however, not been able to match the growth of that data. One of the main challenges in applying convolutional neural networks on gene interaction data is the lack of understanding of the vector space domain to which they belong and also the inherent difficulties involved in representing those interactions on a significantly lower dimension viz Euclidean spaces. These challenges become more prevalent when dealing with various types of omics data with different forms. In this regard, we introduce a systematic, and generalized method, called iSOM-GSN, used to transform multi-omic genomic data with higher-dimensions into a two-dimensional grid. Afterwards, we apply a convolutional neural network (CNN) to predict disease states of various types. Based on the idea of the Kohonen\u27s self-organizing map (SOM), we generate a two-dimensional grid for each sample for a given set of genes that represent a gene similarity network (GSN). The set of genes that are significantly highly mutated across the whole genome, are related to each other based on functional interactions. We then test the model to predict breast and prostate cancer stages using gene expression, DNA methylation, and copy number alteration, yielding accuracies in the 94-98% range for tumor stages of breast cancer and calculated Gleason scores of prostate cancer with just 14 input genes for both cases. To our knowledge, this is the first attempt to use self-organizing maps and convolutional neural networks on integrating high-dimensional multi-omics data. The scheme not only outputs nearly perfect classification accuracy, but also provides an enhanced scheme for visualization, dimensionality reduction, and interpretation of the results

    Meta-Analysis Reveals Both the Promises and the Challenges of Clinical Metabolomics

    Get PDF
    Clinical metabolomics is a rapidly expanding field focused on identifying molecular biomarkers to aid in the efficient diagnosis and treatment of human diseases. Variations in study design, metabolomics methodologies, and investigator protocols raise serious concerns about the accuracy and reproducibility of these potential biomarkers. The explosive growth of the field has led to the recent availability of numerous replicate clinical studies, which permits an evaluation of the consistency of biomarkers identified across multiple metabolomics projects. Pancreatic ductal adenocarcinoma (PDAC) is the third-leading cause of cancer-related death and has the lowest five-year survival rate primarily due to the lack of an early diagnosis and the limited treatment options. Accordingly, PDAC has been a popular target of clinical metabolomics studies. We compiled 24 PDAC metabolomics studies from the scientific literature for a detailed meta-analysis. A consistent identification across these multiple studies allowed for the validation of potential clinical biomarkers of PDAC while also highlighting variations in study protocols that may explain poor reproducibility. Our meta-analysis identified 10 metabolites that may serve as PDAC biomarkers and warrant further investigation. However, 87% of the 655 metabolites identified as potential biomarkers were identified in single studies. Differences in cohort size and demographics, p-value choice, fold-change significance, sample type, handling and storage, data collection, and analysis were all factors that likely contributed to this apparently large false positive rate. Our meta-analysis demonstrated the need for consistent experimental design and normalized practices to accurately leverage clinical metabolomics data for reliable and reproducible biomarker discovery

    A Deep Learning Approach for Multi-Omics Data Integration to Diagnose Early-Onset Colorectal Cancer

    Get PDF
    Colorectal cancer is one of the most common cancers and is a leading cause of death worldwide. It starts in the colon or the rectum, and they are often grouped together because they have many features in common. It has been noticed that colorectal cancer attacks young-onset patients who are less than 50 years of age in increasing rates lately. Rapid developments in omics technologies have led them to be highly regarded in the field of biomedical research for the early detection of cancer. Omics data revealed how different molecules and clinical features work together in the disease progression. However, Omics data sources are variants in nature and require careful preprocessing to be integrated. A convolutional neural network is a class of deep neural networks, commonly applied to analyze visual imagery. In this thesis, we propose a model that converts one-dimensional vectors of omics into RGB images to be integrated into the hidden layers of the convolutional neural network. The prediction model will allow all different omics to contribute to the decision making based on extracting the hidden interactions among these omics. These subsets of interacted omics can serve as potential biomarkers for young-onset colorectal cancer
    • …
    corecore