137 research outputs found

    Performance and Competitiveness of Tree-Based Pipeline Optimization Tool

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceAutomated machine learning (AutoML) is the process of automating the entire machine learn-ing workflow when applied to real-world problems. AutoML can increase data science produc-tivity while keeping the same performance and accuracy, allowing non-experts to use complex machine learning methods. Tree-based Pipeline Optimization Tool (TPOT) was one of the first AutoML methods created by data scientists and is targeted to optimize machine learning pipe-lines using genetic programming. While still under active development, TPOT is a very prom-ising AutoML tool. This Thesis aims to explore the algorithm and analyse its performance using real word data. Results show that evolution-based optimization is at least as accurate as TPOT initialization. The effectiveness of genetic operators, however, depends on the nature of the test case

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Relief-based Feature Selection Algorithms

    Get PDF
    Tato diplomová práce popisuje fungování relief algoritmů a jak je lze využít k extrahování relevantních atributů (feature selection) za účelem předzpracování dat pro lepší práci. Vytvořena byla také experimentální aplikace s několika verzemi relief algoritmů k ověření funkčnosti algoritmů a provádění experimentů s následnou vizualizací výstupu v podobě shlukování nad zpracovanými daty. K závěru popíšeme praktické příklady a použití relief algoritmů v rámci předzpracování dat.This master thesis describes how relief algorithms work and how to use them in extraction of relevant attributes (feature selection) to preprocess data. We made an experimental application with various implementations of relief algorithms to verify their functionality and to realize various experiments. Application includes visual output in form of clustering of processed data. As a conclusion we describe practical usages of relief algorithms in data preprocessing.460 - Katedra informatikyvýborn

    Immersive analytics for oncology patient cohorts

    Get PDF
    This thesis proposes a novel interactive immersive analytics tool and methods to interrogate the cancer patient cohort in an immersive virtual environment, namely Virtual Reality to Observe Oncology data Models (VROOM). The overall objective is to develop an immersive analytics platform, which includes a data analytics pipeline from raw gene expression data to immersive visualisation on virtual and augmented reality platforms utilising a game engine. Unity3D has been used to implement the visualisation. Work in this thesis could provide oncologists and clinicians with an interactive visualisation and visual analytics platform that helps them to drive their analysis in treatment efficacy and achieve the goal of evidence-based personalised medicine. The thesis integrates the latest discovery and development in cancer patients’ prognoses, immersive technologies, machine learning, decision support system and interactive visualisation to form an immersive analytics platform of complex genomic data. For this thesis, the experimental paradigm that will be followed is in understanding transcriptomics in cancer samples. This thesis specifically investigates gene expression data to determine the biological similarity revealed by the patient's tumour samples' transcriptomic profiles revealing the active genes in different patients. In summary, the thesis contributes to i) a novel immersive analytics platform for patient cohort data interrogation in similarity space where the similarity space is based on the patient's biological and genomic similarity; ii) an effective immersive environment optimisation design based on the usability study of exocentric and egocentric visualisation, audio and sound design optimisation; iii) an integration of trusted and familiar 2D biomedical visual analytics methods into the immersive environment; iv) novel use of the game theory as the decision-making system engine to help the analytics process, and application of the optimal transport theory in missing data imputation to ensure the preservation of data distribution; and v) case studies to showcase the real-world application of the visualisation and its effectiveness

    Radiogenomics Framework for Associating Medical Image Features with Tumour Genetic Characteristics

    Get PDF
    Significant progress has been made in the understanding of human cancers at the molecular genetics level and it is providing new insights into their underlying pathophysiology. This progress has enabled the subclassification of the disease and the development of targeted therapies that address specific biological pathways. However, obtaining genetic information remains invasive and costly. Medical imaging is a non-invasive technique that captures important visual characteristics (i.e. image features) of abnormalities and plays an important role in routine clinical practice. Advancements in computerised medical image analysis have enabled quantitative approaches to extract image features that can reflect tumour genetic characteristics, leading to the emergence of ‘radiogenomics’. Radiogenomics investigates the relationships between medical imaging features and tumour molecular characteristics, and enables the derivation of imaging surrogates (radiogenomics features) to genetic biomarkers that can provide alternative approaches to non-invasive and accurate cancer diagnosis. This thesis presents a new framework that combines several novel methods for radiogenomics analysis that associates medical image features with tumour genetic characteristics, with the main objectives being: i) a comprehensive characterisation of tumour image features that reflect underlying genetic information; ii) a method that identifies radiogenomics features encoding common pathophysiological information across different diseases, overcoming the dependence on large annotated datasets; and iii) a method that quantifies radiogenomics features from multi-modal imaging data and accounts for unique information encoded in tumour heterogeneity sub-regions. The present radiogenomics methods advance radiogenomics analysis and contribute to improving research in computerised medical image analysis

    Joint learning from multiple information sources for biological problems

    Get PDF
    Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts

    Phenotyping the histopathological subtypes of non-small-cell lung carcinoma: how beneficial is radiomics?

    Get PDF
    The aim of this study was to investigate the usefulness of radiomics in the absence of well-defined standard guidelines. Specifically, we extracted radiomics features from multicenter computed tomography (CT) images to differentiate between the four histopathological subtypes of non-small-cell lung carcinoma (NSCLC). In addition, the results that varied with the radiomics model were compared. We investigated the presence of the batch effects and the impact of feature harmonization on the models' performance. Moreover, the question on how the training dataset composition influenced the selected feature subsets and, consequently, the model's performance was also investigated. Therefore, through combining data from the two publicly available datasets, this study involves a total of 152 squamous cell carcinoma (SCC), 106 large cell carcinoma (LCC), 150 adenocarcinoma (ADC), and 58 no other specified (NOS). Through the matRadiomics tool, which is an example of Image Biomarker Standardization Initiative (IBSI) compliant software, 1781 radiomics features were extracted from each of the malignant lesions that were identified in CT images. After batch analysis and feature harmonization, which were based on the ComBat tool and were integrated in matRadiomics, the datasets (the harmonized and the non-harmonized) were given as an input to a machine learning modeling pipeline. The following steps were articulated: (i) training-set/test-set splitting (80/20); (ii) a Kruskal-Wallis analysis and LASSO linear regression for the feature selection; (iii) model training; (iv) a model validation and hyperparameter optimization; and (v) model testing. Model optimization consisted of a 5-fold cross-validated Bayesian optimization, repeated ten times (inner loop). The whole pipeline was repeated 10 times (outer loop) with six different machine learning classification algorithms. Moreover, the stability of the feature selection was evaluated. Results showed that the batch effects were present even if the voxels were resampled to an isotropic form and whether feature harmonization correctly removed them, even though the models' performances decreased. Moreover, the results showed that a low accuracy (61.41%) was reached when differentiating between the four subtypes, even though a high average area under curve (AUC) was reached (0.831). Further, a NOS subtype was classified as almost completely correct (true positive rate similar to 90%). The accuracy increased (77.25%) when only the SCC and ADC subtypes were considered, as well as when a high AUC (0.821) was obtained-although harmonization decreased the accuracy to 58%. Moreover, the features that contributed the most to models' performance were those extracted from wavelet decomposed and Laplacian of Gaussian (LoG) filtered images and they belonged to the texture feature class.. In conclusion, we showed that our multicenter data were affected by batch effects, that they could significantly alter the models' performance, and that feature harmonization correctly removed them. Although wavelet features seemed to be the most informative features, an absolute subset could not be identified since it changed depending on the training/testing splitting. Moreover, performance was influenced by the chosen dataset and by the machine learning methods, which could reach a high accuracy in binary classification tasks, but could underperform in multiclass problems.It is, therefore, essential that the scientific community propose a more systematic radiomics approach, focusing on multicenter studies, with clear and solid guidelines to facilitate the translation of radiomics to clinical practice

    Modelling, Monitoring, Control and Optimization for Complex Industrial Processes

    Get PDF
    This reprint includes 22 research papers and an editorial, collected from the Special Issue "Modelling, Monitoring, Control and Optimization for Complex Industrial Processes", highlighting recent research advances and emerging research directions in complex industrial processes. This reprint aims to promote the research field and benefit the readers from both academic communities and industrial sectors

    Developing Novel Computer Aided Diagnosis Schemes for Improved Classification of Mammography Detected Masses

    Get PDF
    Mammography imaging is a population-based breast cancer screening tool that has greatly aided in the decrease in breast cancer mortality over time. Although mammography is the most frequently employed breast imaging modality, its performance is often unsatisfactory with low sensitivity and high false positive rates. This is due to the fact that reading and interpreting mammography images remains difficult due to the heterogeneity of breast tumors and dense overlapping fibroglandular tissue. To help overcome these clinical challenges, researchers have made great efforts to develop computer-aided detection and/or diagnosis (CAD) schemes to provide radiologists with decision-making support tools. In this dissertation, I investigate several novel methods for improving the performance of a CAD system in distinguishing between malignant and benign masses. The first study, we test the hypothesis that handcrafted radiomics features and deep learning features contain complementary information, therefore the fusion of these two types of features will increase the feature representation of each mass and improve the performance of CAD system in distinguishing malignant and benign masses. Regions of interest (ROI) surrounding suspicious masses are extracted and two types of features are computed. The first set consists of 40 radiomic features and the second set includes deep learning (DL) features computed from a pretrained VGG16 network. DL features are extracted from two pseudo color image sets, producing a total of three feature vectors after feature extraction, namely: handcrafted, DL-stacked, DL-pseudo. Linear support vector machines (SVM) are trained using each feature set alone and in combinations. Results show that the fusion CAD system significantly outperforms the systems using either feature type alone (AUC=0.756±0.042 p<0.05). This study demonstrates that both handcrafted and DL futures contain useful complementary information and that fusion of these two types of features increases the CAD classification performance. In the second study, we expand upon our first study and develop a novel CAD framework that fuses information extracted from ipsilateral views of bilateral mammograms using both DL and radiomics feature extraction methods. Each case in this study is represented by four images which includes the craniocaudal (CC) and mediolateral oblique (MLO) view of left and right breast. First, we extract matching ROIs from each of the four views using an ipsilateral matching and bilateral registration scheme to ensure masses are appropriately matched. Next, the handcrafted radiomics features and VGG16 model-generated features are extracted from each ROI resulting in eight feature vectors. Then, after reducing feature dimensionality and quantifying the bilateral asymmetry, we test four fusion methods. Results show that multi-view CAD systems significantly outperform single-view systems (AUC = 0.876±0.031 vs AUC = 0.817±0.026 for CC view and 0.792±0.026 for MLO view, p<0.001). The study demonstrates that the shift from single-view CAD to four-view CAD and the inclusion of both deep transfer learning and radiomics features increases the feature representation of the mass thus improves CAD performance in distinguishing between malignant and benign breast lesions. In the third study, we build upon the first and second studies and investigate the effects of pseudo color image generation in classifying suspicious mammography detected breast lesions as malignant or benign using deep transfer learning in a multi-view CAD scheme. Seven pseudo color image sets are created through a combination of the original grayscale image, a histogram equalized image, a bilaterally filtered image, and a segmented mass image. Using the multi-view CAD framework developed in the previous study, we observe that the two pseudo-color sets created using a segmented mass in one of the three image channels performed significantly better than all other pseudo-color sets (AUC=0.882, p<0.05 for all comparisons and AUC=0.889, p<0.05 for all comparisons). The results of this study support our hypothesis that pseudo color images generated with a segmented mass optimize the mammogram image feature representation by providing increased complementary information to the CADx scheme which results in an increase in the performance in classifying suspicious mammography detected breast lesions as malignant or benign. In summary, each of the studies presented in this dissertation aim to increase the accuracy of a CAD system in classifying suspicious mammography detected masses. Each of these studies takes a novel approach to increase the feature representation of the mass that needs to be classified. The results of each study demonstrate the potential utility of these CAD schemes as an aid to radiologists in the clinical workflow

    Machine Learning Methods with Noisy, Incomplete or Small Datasets

    Get PDF
    In many machine learning applications, available datasets are sometimes incomplete, noisy or affected by artifacts. In supervised scenarios, it could happen that label information has low quality, which might include unbalanced training sets, noisy labels and other problems. Moreover, in practice, it is very common that available data samples are not enough to derive useful supervised or unsupervised classifiers. All these issues are commonly referred to as the low-quality data problem. This book collects novel contributions on machine learning methods for low-quality datasets, to contribute to the dissemination of new ideas to solve this challenging problem, and to provide clear examples of application in real scenarios
    corecore