55 research outputs found

    Auto-Sklearn 2.0: Hands-free AutoML via Meta-Learning

    Get PDF
    Automated Machine Learning (AutoML) supports practitioners and researchers with the tedious task of designing machine learning pipelines and has recently achieved substantial success. In this paper, we introduce new AutoML approaches motivated by our winning submission to the second ChaLearn AutoML challenge. We develop PoSH Auto-sklearn, which enables AutoML systems to work well on large datasets under rigid time limits by using a new, simple and meta-feature-free meta-learning technique and by employing a successful bandit strategy for budget allocation. However, PoSH Auto-sklearn introduces even more ways of running AutoML and might make it harder for users to set it up correctly. Therefore, we also go one step further and study the design space of AutoML itself, proposing a solution towards truly hands-free AutoML. Together, these changes give rise to the next generation of our AutoML system, Auto-sklearn 2.0. We verify the improvements by these additions in an extensive experimental study on 39 AutoML benchmark datasets. We conclude the paper by comparing to other popular AutoML frameworks and Auto-sklearn 1.0, reducing the relative error by up to a factor of 4.5, and yielding a performance in 10 minutes that is substantially better than what Auto-sklearn 1.0 achieves within an hour

    Heterogeneous ensembles and time series classification techniques for the non-invasive authentication of spirits

    Get PDF
    Spirits are a prime target for fraudulent activity. Particular brands, production processes, and other factors such as age can carry high value, and leave space for mimicry. Further, the improper production of spirits, either maliciously or through negligence, can result in harmful substances being sold for consumption. Lastly, genuine spirits producers themselves must ensure the quality and standardisation of their products before sale. Authenticating spirits can be a time consuming and destructive process, requiring sealed bottles to be opened for access to the product. It is therefore desirable to have a fast, non-invasive means of indicating the authenticity, safety, and correctness of spirits. We advance and prototype such a system based on near infrared spectroscopy, and generate datasets for the detection of correct alcohol concentrations in synthesised spirits, for the presence of methanol in genuine spirits, and for the distinction of particular genuine products in a given bottle. The standard chemometric pipelines for the analysis of spectra involve smoothing of the signal, standardising for global intensity, possible dimensionality reduction, and some form of least squares regression. This has decades of proof behind it, and works under the assumptions of clean signal gathering, potentially the separation of sample and particular substance of interest, and the generally linear relationship of light received/blocked and the analyte’s contents. In the proposed system, at least one of these assumptions must be violated. We therefore investigate the use of modern classification techniques to overcome these challenges. In particular, we investigate and develop ensemble methods and time series classification algorithms. Our first hypothesis is that algorithms which consider the ordered nature of the wavelength features, as opposed to treating the spectra effectively as tabular data, can better handle the structural changes brought about by different bottle and environmental characteristics. The second is that ensembling heterogeneous classifiers is the best initial technique for a new data science problem, but should in particular be helpful for the spirit authentication problem, where different classifiers may be able to correct for different defects in the data. In initial investigations on datasets of synthesised alcohol solutions and different products, we prove the feasibility of the authentication system to make at least indicative predictions of authenticity, but find that it lacks the precision and accuracy needed for anything more than indicative results. Following this, we propose a novel heterogeneous ensembling scheme, CAWPE, and perform a large scale evaluation on public archives to prove its efficacy. We then outline improvements in the time series classification space that lead to the state of the art meta-ensemble HIVECOTE 2.0, which makes use of CAWPE. We lastly apply the developed techniques to a final dataset on methanol concentration detection. We find that the proposed system can classify methanol concentration in arbitrary spirits and bottles from ten possible values, containing as little as 0.25%, to an accuracy of 0.921. We further conclude that while heterogeneously ensembling tabular classifiers does improve the authentication of spirits from spectra, time series classification methods confer no particular advantage beyond tabular methods

    Automatic machine learning:methods, systems, challenges

    Get PDF

    Automatic machine learning:methods, systems, challenges

    Get PDF
    This open access book presents the first comprehensive overview of general methods in Automatic Machine Learning (AutoML), collects descriptions of existing systems based on these methods, and discusses the first international challenge of AutoML systems. The book serves as a point of entry into this quickly-developing field for researchers and advanced students alike, as well as providing a reference for practitioners aiming to use AutoML in their work. The recent success of commercial ML applications and the rapid growth of the field has created a high demand for off-the-shelf ML methods that can be used easily and without expert knowledge. Many of the recent machine learning successes crucially rely on human experts, who select appropriate ML architectures (deep learning architectures or more traditional ML workflows) and their hyperparameters; however the field of AutoML targets a progressive automation of machine learning, based on principles from optimization and machine learning itself

    Shapelet Transforms for Univariate and Multivariate Time Series Classification

    Get PDF
    Time Series Classification (TSC) is a growing field of machine learning research. One particular algorithm from the TSC literature is the Shapelet Transform (ST). Shapelets are a phase independent subsequences that are extracted from times series to form discriminatory features. It has been shown that using the shapelets to transform the datasets into a new space can improve performance. One of the major problems with ST, is that the algorithm is O(n2m4), where n is the number of time series and m is the length of the series. As a problem increases in sizes, or additional dimensions are added, the algorithm quickly becomes computationally infeasible. The research question addressed is whether the shapelet transform be improved in terms of accuracy and speed. Making algorithmic improvements to shapelets will enable the development of multivariate shapelet algorithms that can attempt to solve much larger problems in realistic time frames. In support of this thesis a new distance early abandon method is proposed. A class balancing algorithm is implemented, which uses a one vs. all multi class information gain that enables heuristics which were developed for two class problems. To support these improvements a large scale analysis of the best shapelet algorithms is conducted as part of a larger experimental evaluation. ST is proven to be one of the most accurate algorithms in TSC on the UCR-UEA datasets. Contract classification is proposed for shapelets, where a fixed run time is set, and the number of shapelets is bounded. Four search algorithms are evaluated with fixed run times of one hour and one day, three of which are not significantly worse than a full enumeration. Finally, three multivariate shapelet algorithms are developed and compared to benchmark results and multivariate dynamic time warping

    Multiparametric Magnetic Resonance Imaging Artificial Intelligence Pipeline for Oropharyngeal Cancer Radiotherapy Treatment Guidance

    Get PDF
    Oropharyngeal cancer (OPC) is a widespread disease and one of the few domestic cancers that is rising in incidence. Radiographic images are crucial for assessment of OPC and aid in radiotherapy (RT) treatment. However, RT planning with conventional imaging approaches requires operator-dependent tumor segmentation, which is the primary source of treatment error. Further, OPC expresses differential tumor/node mid-RT response (rapid response) rates, resulting in significant differences between planned and delivered RT dose. Finally, clinical outcomes for OPC patients can also be variable, which warrants the investigation of prognostic models. Multiparametric MRI (mpMRI) techniques that incorporate simultaneous anatomical and functional information coupled to artificial intelligence (AI) approaches could improve clinical decision support for OPC by providing immediately actionable clinical rationale for adaptive RT planning. If tumors could be reproducibly segmented, rapid response could be classified, and prognosis could be reliably determined, overall patient outcomes would be optimized to improve the therapeutic index as a function of more risk-adapted RT volumes. Consequently, there is an unmet need for automated and reproducible imaging which can simultaneously segment tumors and provide predictive value for actionable RT adaptation. This dissertation primarily seeks to explore and optimize image processing, tumor segmentation, and patient outcomes in OPC through a combination of advanced imaging techniques and AI algorithms. In the first specific aim of this dissertation, we develop and evaluate mpMRI pre-processing techniques for use in downstream segmentation, response prediction, and outcome prediction pipelines. Various MRI intensity standardization and registration approaches were systematically compared and benchmarked. Moreover, synthetic image algorithms were developed to decrease MRI scan time in an effort to optimize our AI pipelines. We demonstrated that proper intensity standardization and image registration can improve mpMRI quality for use in AI algorithms, and developed a novel method to decrease mpMRI acquisition time. Subsequently, in the second specific aim of this dissertation, we investigated underlying questions regarding the implementation of RT-related auto-segmentation. Firstly, we quantified interobserver variability for an unprecedented large number of observers for various radiotherapy structures in several disease sites (with a particular emphasis on OPC) using a novel crowdsourcing platform. We then trained an AI algorithm on a series of extant matched mpMRI datasets to segment OPC primary tumors. Moreover, we validated and compared our best model\u27s performance to clinical expert observers. We demonstrated that AI-based mpMRI OPC tumor auto-segmentation offers decreased variability and comparable accuracy to clinical experts, and certain mpMRI input channel combinations could further improve performance. Finally, in the third specific aim of this dissertation, we predicted OPC primary tumor mid-therapy (rapid) treatment response and prognostic outcomes. Using co-registered pre-therapy and mid-therapy primary tumor manual segmentations of OPC patients, we generated and characterized treatment sensitive and treatment resistant pre-RT sub-volumes. These sub-volumes were used to train an AI algorithm to predict individual voxel-wise treatment resistance. Additionally, we developed an AI algorithm to predict OPC patient progression free survival using pre-therapy imaging from an international data science competition (ranking 1st place), and then translated these approaches to mpMRI data. We demonstrated AI models could be used to predict rapid response and prognostic outcomes using pre-therapy imaging, which could help guide treatment adaptation, though further work is needed. In summary, the completion of these aims facilitates the development of an image-guided fully automated OPC clinical decision support tool. The resultant deliverables from this project will positively impact patients by enabling optimized therapeutic interventions in OPC. Future work should consider investigating additional imaging timepoints, imaging modalities, uncertainty quantification, perceptual and ethical considerations, and prospective studies for eventual clinical implementation. A dynamic version of this dissertation is publicly available and assigned a digital object identifier through Figshare (doi: 10.6084/m9.figshare.22141871)

    EDM 2011: 4th international conference on educational data mining : Eindhoven, July 6-8, 2011 : proceedings

    Get PDF

    Mining time-series data using discriminative subsequences

    Get PDF
    Time-series data is abundant, and must be analysed to extract usable knowledge. Local-shape-based methods offer improved performance for many problems, and a comprehensible method of understanding both data and models. For time-series classification, we transform the data into a local-shape space using a shapelet transform. A shapelet is a time-series subsequence that is discriminative of the class of the original series. We use a heterogeneous ensemble classifier on the transformed data. The accuracy of our method is significantly better than the time-series classification benchmark (1-nearest-neighbour with dynamic time-warping distance), and significantly better than the previous best shapelet-based classifiers. We use two methods to increase interpretability: First, we cluster the shapelets using a novel, parameterless clustering method based on Minimum Description Length, reducing dimensionality and removing duplicate shapelets. Second, we transform the shapelet data into binary data reflecting the presence or absence of particular shapelets, a representation that is straightforward to interpret and understand. We supplement the ensemble classifier with partial classifocation. We generate rule sets on the binary-shapelet data, improving performance on certain classes, and revealing the relationship between the shapelets and the class label. To aid interpretability, we use a novel algorithm, BruteSuppression, that can substantially reduce the size of a rule set without negatively affecting performance, leading to a more compact, comprehensible model. Finally, we propose three novel algorithms for unsupervised mining of approximately repeated patterns in time-series data, testing their performance in terms of speed and accuracy on synthetic data, and on a real-world electricity-consumption device-disambiguation problem. We show that individual devices can be found automatically and in an unsupervised manner using a local-shape-based approach

    Contributions of biomechanical modeling and machine learning to the automatic registration of Multiparametric Magnetic Resonance and Transrectal Echography for prostate brachytherapy

    Get PDF
    El cáncer de próstata (CaP) es el primer cáncer por incidencia en hombres en países occidentales, y el tercero en mortalidad. Tras detectar en sangre una elevación del Antígeno Prostático Específico (PSA) o tras tacto rectal sospechoso se realiza una Resonancia Magnética (RM) de la próstata, que los radiólogos analizan para localizar las regiones sospechosas. A continuación, estas se biopsian, es decir, se toman muestras vivas que posteriormente serán analizadas histopatológicamente para confirmar la presencia de cáncer y establecer su grado de agresividad. Durante la biopsia se emplea típicamente Ultrasonidos (US) para el guiado y la localización de las lesiones. Sin embargo, estas no son directamente visibles en US, y el urólogo necesita usar software de fusión que realice un registro RM-US que transfiera la localizaciones marcadas en MR al US. Esto es fundamental para asegurar que las muestras tomadas provienen verdaderamente de la zona sospechosa. En este trabajo se compendian cinco publicaciones que emplean diversos algoritmos de Inteligencia Artificial (IA) para analizar las imágenes de próstata (RM y US) y con ello mejorar la eficiencia y precisión en el diagnóstico, biopsia y tratamiento del CaP: 1. Segmentación automática de próstata en RM y US: Segmentar la próstata consiste en delimitar o marcar la próstata en una imagen médica, separándola del resto de órganos o estructuras. Automatizar por completo esta tarea, que es previa a todo análisis posterior, permite ahorrar un tiempo significativo a radiólogos y urólogos, mejorando también la precisión y repetibilidad. 2. Mejora de la resolución de segmentación: Se presenta una metodología para mejorar la resolución de las segmentaciones anteriores. 3. Detección y clasificación automática de lesiones en RM: Se entrena un modelo basado en IA para detectar las lesiones como lo haría un radiólogo, asignándoles también una estimación del riesgo. Se logra mejorar la precisión diagnóstica, dando lugar a un sistema totalmente automático que podría implantarse para segunda opinión clínica o como criterio para priorización. 4. Simulación del comportamiento biomecánico en tiempo real: Se propone acelerar la simulación del comportamiento biomecánico de órganos blandos mediante el uso de IA. 5. Registro automático RM-US: El registro permite localizar en US las lesiones marcadas en RM. Una alta precisión en esta tarea es esencial para la corrección de la biopsia y/o del tratamiento focal del paciente (como braquiterapia de alta tasa). Se plantea el uso de la IA para resolver el problema de registro en tiempo casi real, utilizando modelos biomecánicos subyacentes.Prostate cancer (PCa) is the most common malignancy in western males, and third by mortality. After detecting elevated Prostate Specific Antigen (PSA) blood levels or after a suspicious rectal examination, a Magnetic Resonance (MR) image of the prostate is acquired and assessed by radiologists to locate suspicious regions. These are then biopsied, i.e. living tissue samples are collected and analyzed histopathologically to confirm the presence of cancer and establish its degree of aggressiveness. During the biopsy procedure, Ultrasound (US) is typically used for guidance and lesion localization. However, lesions are not directly visible in US, and the urologist needs to use fusion software to performs MR-US registration, so that the MR-marked locations can be transferred to the US image. This is essential to ensure that the collected samples truly come from the suspicious area. This work compiles five publications employing several Artificial Intelligence (AI) algorithms to analyze prostate images (MR and US) and thereby improve the efficiency and accuracy in diagnosis, biopsy and treatment of PCa: 1. Automatic prostate segmentation in MR and US: Prostate segmentation consists in delimiting or marking the prostate in a medical image, separating it from the rest of the organs or structures. Automating this task fully, which is required for any subsequent analysis, saves significant time for radiologists and urologists, while also improving accuracy and repeatability. 2. Segmentation resolution enhancement: A methodology for improving the resolution of the previously obtained segmentations is presented. 3. Automatic detection and classification of MR lesions: An AI model is trained to detect lesions as a radiologist would and to estimate their risk. The model achieves improved diagnostic accuracy, resulting in a fully automatic system that could be used as a second clinical opinion or as a criterion for patient prioritization. 4. Simulation of biomechanical behavior in real time: It is proposed to accelerate the simulation of biomechanical behavior of soft organs using AI. 5. Automatic MR-US registration: Registration allows localization of MR-marked lesions on US. High accuracy in this task is essential for the correctness of the biopsy and/or focal treatment procedures (such as high-rate brachytherapy). Here, AI is used to solve the registration problem in near-real time, while exploiting underlying biomechanically-compatible models

    Deep Learning-Based Particle Detection and Instance Segmentation for Microscopy Images

    Get PDF
    Bildgebende mikroskopische Verfahren ermöglichen Forschern, Einblicke in komplexe, bisher unverstandene Prozesse zu gewinnen. Um den Forschern den Weg zu neuen Erkenntnissen zu erleichtern, sind hoch-automatisierte, vielseitige, genaue, benutzerfreundliche und zuverlässige Methoden zur Partikeldetektion und Instanzsegmentierung erforderlich. Diese Methoden sollten insbesondere für unterschiedliche Bildgebungsbedingungen und Anwendungen geeignet sein, ohne dass Expertenwissen für Anpassungen erforderlich ist. Daher werden in dieser Arbeit eine neue auf Deep Learning basierende Methode zur Partikeldetektion und zwei auf Deep Learning basierende Methoden zur Instanzsegmentierung vorgestellt. Der Partikeldetektionsansatz verwendet einen von der Partikelgröße abhängigen Hochskalierungs-Schritt und ein U-Net Netzwerk für die semantische Segmentierung von Partikelmarkern. Nach der Validierung der Hochskalierung mit synthetisch erzeugten Daten wird die Partikeldetektionssoftware BeadNet vorgestellt. Die Ergebnisse auf einem Datensatz mit fluoreszierenden Latex-Kügelchen zeigen, dass BeadNet Partikel genauer als traditionelle Methoden detektieren kann. Die beiden neuen Instanzsegmentierungsmethoden verwenden ein U-Net Netzwerk mit zwei Decodern und werden für vier Objektarten und drei Mikroskopie-Bildgebungsverfahren evaluiert. Für die Evaluierung werden ein einzelner nicht balancierter Trainingsdatensatz und ein einzelner Satz von Postprocessing-Parametern verwendet. Danach wird die bessere Methode in der Cell Tracking Challenge weiter validiert, wobei mehrere Top-3-Platzierungen und für sechs Datensätze eine mit einem menschlichen Experten vergleichbare Leistung erreicht werden. Außerdem wird die neue Instanzsegmentierungssoftware microbeSEG vorgestellt. microbeSEG verwendet, analog zu BeadNet, OMERO für die Datenverwaltung und bietet Funktionen für die Erstellung von Trainingsdaten, das Trainieren von Modellen, die Modellevaluation und die Modellanwendung. Die qualitativen Anwendungen von BeadNet und microbeSEG zeigen, dass beide Tools eine genaue Auswertung vieler verschiedener Mikroskopie-Bilddaten ermöglichen. Abschließend gibt diese Dissertation einen Ausblick auf den Bedarf an weiteren Richtlinien für Bildanalyse-Wettbewerbe und Methodenvergleiche für eine zielgerichtete zukünftige Methodenentwicklung
    corecore