22 research outputs found

    Consistency-based Semi-supervised Active Learning: Towards Minimizing Labeling Cost

    Full text link
    Active learning (AL) combines data labeling and model training to minimize the labeling cost by prioritizing the selection of high value data that can best improve model performance. In pool-based active learning, accessible unlabeled data are not used for model training in most conventional methods. Here, we propose to unify unlabeled sample selection and model training towards minimizing labeling cost, and make two contributions towards that end. First, we exploit both labeled and unlabeled data using semi-supervised learning (SSL) to distill information from unlabeled data during the training stage. Second, we propose a consistency-based sample selection metric that is coherent with the training objective such that the selected samples are effective at improving model performance. We conduct extensive experiments on image classification tasks. The experimental results on CIFAR-10, CIFAR-100 and ImageNet demonstrate the superior performance of our proposed method with limited labeled data, compared to the existing methods and the alternative AL and SSL combinations. Additionally, we study an important yet under-explored problem -- "When can we start learning-based AL selection?". We propose a measure that is empirically correlated with the AL target loss and is potentially useful for determining the proper starting point of learning-based AL methods.Comment: Accepted by ECCV202

    Real-time numerical forecast of global epidemic spreading: Case study of 2009 A/H1N1pdm

    Get PDF
    Background Mathematical and computational models for infectious diseases are increasingly used to support public-health decisions; however, their reliability is currently under debate. Real-time forecasts of epidemic spread using data-driven models have been hindered by the technical challenges posed by parameter estimation and validation. Data gathered for the 2009 H1N1 influenza crisis represent an unprecedented opportunity to validate real-time model predictions and define the main success criteria for different approaches. Methods We used the Global Epidemic and Mobility Model to generate stochastic simulations of epidemic spread worldwide, yielding (among other measures) the incidence and seeding events at a daily resolution for 3,362 subpopulations in 220 countries. Using a Monte Carlo Maximum Likelihood analysis, the model provided an estimate of the seasonal transmission potential during the early phase of the H1N1 pandemic and generated ensemble forecasts for the activity peaks in the northern hemisphere in the fall/winter wave. These results were validated against the real-life surveillance data collected in 48 countries, and their robustness assessed by focusing on 1) the peak timing of the pandemic; 2) the level of spatial resolution allowed by the model; and 3) the clinical attack rate and the effectiveness of the vaccine. In addition, we studied the effect of data incompleteness on the prediction reliability. Results Real-time predictions of the peak timing are found to be in good agreement with the empirical data, showing strong robustness to data that may not be accessible in real time (such as pre-exposure immunity and adherence to vaccination campaigns), but that affect the predictions for the attack rates. The timing and spatial unfolding of the pandemic are critically sensitive to the level of mobility data integrated into the model. Conclusions Our results show that large-scale models can be used to provide valuable real-time forecasts of influenza spreading, but they require high-performance computing. The quality of the forecast depends on the level of data integration, thus stressing the need for high-quality data in population-based models, and of progressive updates of validated available empirical knowledge to inform these models

    Clustering daily patterns of human activities in the city

    Get PDF
    Data mining and statistical learning techniques are powerful analysis tools yet to be incorporated in the domain of urban studies and transportation research. In this work, we analyze an activity-based travel survey conducted in the Chicago metropolitan area over a demographic representative sample of its population. Detailed data on activities by time of day were collected from more than 30,000 individuals (and 10,552 households) who participated in a 1-day or 2-day survey implemented from January 2007 to February 2008. We examine this large-scale data in order to explore three critical issues: (1) the inherent daily activity structure of individuals in a metropolitan area, (2) the variation of individual daily activities—how they grow and fade over time, and (3) clusters of individual behaviors and the revelation of their related socio-demographic information. We find that the population can be clustered into 8 and 7 representative groups according to their activities during weekdays and weekends, respectively. Our results enrich the traditional divisions consisting of only three groups (workers, students and non-workers) and provide clusters based on activities of different time of day. The generated clusters combined with social demographic information provide a new perspective for urban and transportation planning as well as for emergency response and spreading dynamics, by addressing when, where, and how individuals interact with places in metropolitan areas.Massachusetts Institute of Technology. Dept. of Urban Studies and PlanningUnited States. Dept. of Transportation (Region One University Transportation Center)Singapore-MIT Alliance for Research and Technolog

    Chain-detection Between Clusters

    No full text

    Interactive deep metric learning for healthcare cohort discovery

    Full text link
    © Springer Nature Singapore Pte Ltd. 2019. Given the continuous growth of large-scale complex electronic healthcare data, a data-driven healthcare cohort discovery facilitated by machine learning tools with domain expert knowledge is required to gain further insights of the healthcare system. Specifically, clustering plays a crucial role in healthcare cohort discovery, and metric learning is able to incorporate expert feedback to generate more fit-for-purpose clustering outputs. However, most of the existing metric learning methods assume all labelled instances already pre-exists, which is not always true in real-world applications. In addition, big data in healthcare also brings new challenges to metric learning on handling complex structured data. In this paper, we propose a novel systematic method, namely Interactive Deep Metric Learning (IDML), which uses an interactive process to iteratively incorporate feedback from domain experts to identify cohorts that are more relevant to a particular pre-defined purpose. Moreover, the proposed method leverages powerful deep learning-based embedding techniques to incrementally gain effective representations for the complex structures inherit in patient journey data. We experimentally evaluate the effectiveness of the proposed IDML using two public healthcare datasets. The proposed method has also been implemented into an interactive cohort discovery tool for a real-world application in healthcare

    A Distribution Dependent and Independent Complexity Analysis of Manifold Regularization

    No full text
    Manifold regularization is a commonly used technique in semi-supervised learning. It enforces the classification rule to be smooth with respect to the data-manifold. Here, we derive sample complexity bounds based on pseudo-dimension for models that add a convex data dependent regularization term to a supervised learning process, as is in particular done in Manifold regularization. We then compare the bound for those semi-supervised methods to purely supervised methods, and discuss a setting in which the semi-supervised method can only have a constant improvement, ignoring logarithmic terms. By viewing Manifold regularization as a kernel method we then derive Rademacher bounds which allow for a distribution dependent analysis. Finally we illustrate that these bounds may be useful for choosing an appropriate manifold regularization parameter in situations with very sparsely labeled data.Virtual/online event due to COVID-19Interactive IntelligencePattern Recognition and Bioinformatic

    Personality prediction for microblog users with active learning method

    No full text
    Personality research on social media is a hot topic recently due to the rapid development of social medias well as the central importance of personality in psychology, but it is hard to acquire adequate appropriate labeled samples. Our research aims to choose the right users to be labeled to improve the accuracy of predicting. a few labeled users, the task is to predict personality of other unlabeled users Given a set of Microblog users&rsquo; public information (e.g., number of followers) and. The active learning regression algorithm has been employed to establish predicting model in this paper, and the experimental results demonstrate our method can fairly well predict the personality of Microblog users. &copy; Springer International Publishing Switzerland 2015.</p

    Overlapping Hierarchical Clustering (OHC)

    Get PDF
    International audienceAgglomerative clustering methods have been widely used by many research communities to cluster their data into hierarchical structures. These structures ease data exploration and are understandable even for non-specialists. But these methods necessarily result in a tree, since, at each agglomeration step, two clusters have to be merged. This may bias the data analysis process if, for example, a cluster is almost equally attracted by two others. In this paper we propose a new method that allows clusters to overlap until a strong cluster attraction is reached, based on a density criterion. The resulting hierarchical structure, called a quasi-dendrogram, is represented as a directed acyclic graph and combines the advantages of hierarchies with the precision of a less arbitrary clustering. We validate our work with extensive experiments on real data sets and compare it with existing tree-based methods, using a new measure of similarity between heterogeneous hierarchical structures
    corecore