323 research outputs found

    Online Optimization Methods for the Quantification Problem

    Full text link
    The estimation of class prevalence, i.e., the fraction of a population that belongs to a certain class, is a very useful tool in data analytics and learning, and finds applications in many domains such as sentiment analysis, epidemiology, etc. For example, in sentiment analysis, the objective is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather estimate the overall distribution of positive and negative sentiments during an event window. A popular way of performing the above task, often dubbed quantification, is to use supervised learning to train a prevalence estimator from labeled data. Contemporary literature cites several performance measures used to measure the success of such prevalence estimators. In this paper we propose the first online stochastic algorithms for directly optimizing these quantification-specific performance measures. We also provide algorithms that optimize hybrid performance measures that seek to balance quantification and classification performance. Our algorithms present a significant advancement in the theory of multivariate optimization and we show, by a rigorous theoretical analysis, that they exhibit optimal convergence. We also report extensive experiments on benchmark and real data sets which demonstrate that our methods significantly outperform existing optimization techniques used for these performance measures.Comment: 26 pages, 6 figures. A short version of this manuscript will appear in the proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 201

    Designing Deep Learning Frameworks for Plant Biology

    Get PDF
    In recent years the parallel progress in high-throughput microscopy and deep learning drastically widened the landscape of possible research avenues in life sciences. In particular, combining high-resolution microscopic images and automated imaging pipelines powered by deep learning dramatically reduced the manual annotation work required for quantitative analysis. In this work, we will present two deep learning frameworks tailored to the needs of life scientists in the context of plant biology. First, we will introduce PlantSeg, a software for 2D and 3D instance segmentation. The PlantSeg pipeline contains several pre-trained models for different microscopy modalities and multiple popular graph-based instance segmentation algorithms. In the second part, we will present CellTypeGraph, a benchmark for quantitatively evaluating graph neural networks. The benchmark is designed to test the ability of machine learning methods to classify the types of cells in an \textit{Arabidopsis thaliana} ovules. CellTypeGraph's prime aim is to give a valuable tool to the geometric learning community, but at the same time it also offers a framework for plant biologists to perform fast and accurate cell type inference on new data

    Metabolomics Data Processing and Data Analysis—Current Best Practices

    Get PDF
    Metabolomics data analysis strategies are central to transforming raw metabolomics data files into meaningful biochemical interpretations that answer biological questions or generate novel hypotheses. This book contains a variety of papers from a Special Issue around the theme “Best Practices in Metabolomics Data Analysis”. Reviews and strategies for the whole metabolomics pipeline are included, whereas key areas such as metabolite annotation and identification, compound and spectral databases and repositories, and statistical analysis are highlighted in various papers. Altogether, this book contains valuable information for researchers just starting in their metabolomics career as well as those that are more experienced and look for additional knowledge and best practice to complement key parts of their metabolomics workflows

    Automatic face recognition using stereo images

    Get PDF
    Face recognition is an important pattern recognition problem, in the study of both natural and artificial learning problems. Compaxed to other biometrics, it is non-intrusive, non- invasive and requires no paxticipation from the subjects. As a result, it has many applications varying from human-computer-interaction to access control and law-enforcement to crowd surveillance. In typical optical image based face recognition systems, the systematic vaxiability arising from representing the three-dimensional (3D) shape of a face by a two-dimensional (21)) illumination intensity matrix is treated as random vaxiability. Multiple examples of the face displaying vaxying pose and expressions axe captured in different imaging conditions. The imaging environment, pose and expressions are strictly controlled and the images undergo rigorous normalisation and pre-processing. This may be implemented in a paxtially or a fully automated system. Although these systems report high classification accuracies (>90%), they lack versatility and tend to fail when deployed outside laboratory conditions. Recently, more sophisticated 3D face recognition systems haxnessing the depth information have emerged. These systems usually employ specialist equipment such as laser scanners and structured light projectors. Although more accurate than 2D optical image based recognition, these systems are equally difficult to implement in a non-co-operative environment. Existing face recognition systems, both 2D and 3D, detract from the main advantages of face recognition and fail to fully exploit its non-intrusive capacity. This is either because they rely too much on subject co-operation, which is not always available, or because they cannot cope with noisy data. The main objective of this work was to investigate the role of depth information in face recognition in a noisy environment. A stereo-based system, inspired by the human binocular vision, was devised using a pair of manually calibrated digital off-the-shelf cameras in a stereo setup to compute depth information. Depth values extracted from 2D intensity images using stereoscopy are extremely noisy, and as a result this approach for face recognition is rare. This was cofirmed by the results of our experimental work. Noise in the set of correspondences, camera calibration and triangulation led to inaccurate depth reconstruction, which in turn led to poor classifier accuracy for both 3D surface matching and 211) 2 depth maps. Recognition experiments axe performed on the Sheffield Dataset, consisting 692 images of 22 individuals with varying pose, illumination and expressions

    Machine learned boundary definitions for an expert's tracing assistant in image processing

    Get PDF
    Department Head: Anton Willem Bohm.Includes bibliographical references (pages 178-184).Most image processing work addressing boundary definition tasks embeds the assumption that an edge in an image corresponds to the boundary of interest in the world. In straightforward imagery this is true, however it is not always the case. There are images in which edges are indistinct or obscure, and these images can only be segmented by a human expert. The work in this dissertation addresses the range of imagery between the two extremes of those straightforward images and those requiring human guidance to appropriately segment. By freeing systems of a priori edge definitions and building in a mechanism to learn the boundary definitions needed, systems can do better and be more broadly applicable. This dissertation presents the construction of such a boundary-learning system and demonstrates the validity of this premise on real data. A framework was created for the task in which expert-provided boundary exemplars are used to create training data, which in turn are used by a neural network to learn the task and replicate the expert's boundary tracing behavior. This is the framework for the Expert's Tracing Assistant (ETA) system. For a representative set of nine structures in the Visible Human imagery, ETA was compared and contrasted to two state-of-the-art, user guided methods--Intelligent Scissors (IS) and Active Contour Models (ACM). Each method was used to define a boundary, and the distances between these boundaries and an expert's ground truth were compared. Across independent trials, there will be a natural variation in an expert's boundary tracing, and this degree of variation served as a benchmark against which these three methods were compared. For simple structural boundaries, all the methods were equivalent. However, in more difficult cases, ETA was shown to significantly better replicate the expert's boundary than either IS or ACM. In these cases, where the expert's judgement was most called into play to bound the structure, ACM and IS could not adapt to the boundary character used by the expert while ETA could

    Facilitating and Enhancing the Performance of Model Selection for Energy Time Series Forecasting in Cluster Computing Environments

    Get PDF
    Applying Machine Learning (ML) manually to a given problem setting is a tedious and time-consuming process which brings many challenges with it, especially in the context of Big Data. In such a context, gaining insightful information, finding patterns, and extracting knowledge from large datasets are quite complex tasks. Additionally, the configurations of the underlying Big Data infrastructure introduce more complexity for configuring and running ML tasks. With the growing interest in ML the last few years, particularly people without extensive ML expertise have a high demand for frameworks assisting people in applying the right ML algorithm to their problem setting. This is especially true in the field of smart energy system applications where more and more ML algorithms are used e.g. for time series forecasting. Generally, two groups of non-expert users are distinguished to perform energy time series forecasting. The first one includes the users who are familiar with statistics and ML but are not able to write the necessary programming code for training and evaluating ML models using the well-known trial-and-error approach. Such an approach is time consuming and wastes resources for constructing multiple models. The second group is even more inexperienced in programming and not knowledgeable in statistics and ML but wants to apply given ML solutions to their problem settings. The goal of this thesis is to scientifically explore, in the context of more concrete use cases in the energy domain, how such non-expert users can be optimally supported in creating and performing ML tasks in practice on cluster computing environments. To support the first group of non-expert users, an easy-to-use modular extendable microservice-based ML solution for instrumenting and evaluating ML algorithms on top of a Big Data technology stack is conceptualized and evaluated. Our proposed solution facilitates applying trial-and-error approach by hiding the low level complexities from the users and introduces the best conditions to efficiently perform ML tasks in cluster computing environments. To support the second group of non-expert users, the first solution is extended to realize meta learning approaches for automated model selection. We evaluate how meta learning technology can be efficiently applied to the problem space of data analytics for smart energy systems to assist energy system experts which are not data analytics experts in applying the right ML algorithms to their data analytics problems. To enhance the predictive performance of meta learning, an efficient characterization of energy time series datasets is required. To this end, Descriptive Statistics Time based Meta Features (DSTMF), a new kind of meta features, is designed to accurately capture the deep characteristics of energy time series datasets. We find that DSTMF outperforms the other state-of-the-art meta feature sets introduced in the literature to characterize energy time series datasets in terms of the accuracy of meta learning models and the time needed to extract them. Further enhancement in the predictive performance of the meta learning classification model is achieved by training the meta learner on new efficient meta examples. To this end, we proposed two new approaches to generate new energy time series datasets to be used as training meta examples by the meta learner depending on the type of time series dataset (i.e. generation or energy consumption time series). We find that extending the original training sets with new meta examples generated by our approaches outperformed the case in which the original is extended by new simulated energy time series datasets

    Domain Generalization in Computational Pathology: Survey and Guidelines

    Full text link
    Deep learning models have exhibited exceptional effectiveness in Computational Pathology (CPath) by tackling intricate tasks across an array of histology image analysis applications. Nevertheless, the presence of out-of-distribution data (stemming from a multitude of sources such as disparate imaging devices and diverse tissue preparation methods) can cause \emph{domain shift} (DS). DS decreases the generalization of trained models to unseen datasets with slightly different data distributions, prompting the need for innovative \emph{domain generalization} (DG) solutions. Recognizing the potential of DG methods to significantly influence diagnostic and prognostic models in cancer studies and clinical practice, we present this survey along with guidelines on achieving DG in CPath. We rigorously define various DS types, systematically review and categorize existing DG approaches and resources in CPath, and provide insights into their advantages, limitations, and applicability. We also conduct thorough benchmarking experiments with 28 cutting-edge DG algorithms to address a complex DG problem. Our findings suggest that careful experiment design and CPath-specific Stain Augmentation technique can be very effective. However, there is no one-size-fits-all solution for DG in CPath. Therefore, we establish clear guidelines for detecting and managing DS depending on different scenarios. While most of the concepts, guidelines, and recommendations are given for applications in CPath, we believe that they are applicable to most medical image analysis tasks as well.Comment: Extended Versio

    A Framework for Meta-heuristic Parameter Performance Prediction Using Fitness Landscape Analysis and Machine Learning

    Get PDF
    The behaviour of an optimization algorithm when attempting to solve a problem depends on the values assigned to its control parameters. For an algorithm to obtain desirable performance, its control parameter values must be chosen based on the current problem. Despite being necessary for optimal performance, selecting appropriate control parameter values is time-consuming, computationally expensive, and challenging. As the quantity of control parameters increases, so does the time complexity associated with searching for practical values, which often overshadows addressing the problem at hand, limiting the efficiency of an algorithm. As primarily recognized by the no free lunch theorem, there is no one-size-fits-all to problem-solving; hence from understanding a problem, a tailored approach can substantially help solve it. To predict the performance of control parameter configurations in unseen environments, this thesis crafts an intelligent generalizable framework leveraging machine learning classification and quantitative characteristics about the problem in question. The proposed parameter performance classifier (PPC) framework is extensively explored by training 84 high-accuracy classifiers comprised of multiple sampling methods, fitness types, and binning strategies. Furthermore, the novel framework is utilized in constructing a new parameter-free particle swarm optimization (PSO) variant called PPC-PSO that effectively eliminates the computational cost of parameter tuning, yields competitive performance amongst other leading methodologies across 99 benchmark functions, and is highly accessible to researchers and practitioners. The success of PPC-PSO shows excellent promise for the applicability of the PPC framework in making many more robust parameter-free meta-heuristic algorithms in the future with incredible generalization capabilities
    • …