117 research outputs found

    Scraping social media photos posted in Kenya and elsewhere to detect and analyze food types

    Full text link
    Monitoring population-level changes in diet could be useful for education and for implementing interventions to improve health. Research has shown that data from social media sources can be used for monitoring dietary behavior. We propose a scrape-by-location methodology to create food image datasets from Instagram posts. We used it to collect 3.56 million images over a period of 20 days in March 2019. We also propose a scrape-by-keywords methodology and used it to scrape ∼30,000 images and their captions of 38 Kenyan food types. We publish two datasets of 104,000 and 8,174 image/caption pairs, respectively. With the first dataset, Kenya104K, we train a Kenyan Food Classifier, called KenyanFC, to distinguish Kenyan food from non-food images posted in Kenya. We used the second dataset, KenyanFood13, to train a classifier KenyanFTR, short for Kenyan Food Type Recognizer, to recognize 13 popular food types in Kenya. The KenyanFTR is a multimodal deep neural network that can identify 13 types of Kenyan foods using both images and their corresponding captions. Experiments show that the average top-1 accuracy of KenyanFC is 99% over 10,400 tested Instagram images and of KenyanFTR is 81% over 8,174 tested data points. Ablation studies show that three of the 13 food types are particularly difficult to categorize based on image content only and that adding analysis of captions to the image analysis yields a classifier that is 9 percent points more accurate than a classifier that relies only on images. Our food trend analysis revealed that cakes and roasted meats were the most popular foods in photographs on Instagram in Kenya in March 2019.Accepted manuscrip

    Combining Search, Social Media, and Traditional Data Sources to Improve Influenza Surveillance

    Full text link
    We present a machine learning-based methodology capable of providing real-time ("nowcast") and forecast estimates of influenza activity in the US by leveraging data from multiple data sources including: Google searches, Twitter microblogs, nearly real-time hospital visit records, and data from a participatory surveillance system. Our main contribution consists of combining multiple influenza-like illnesses (ILI) activity estimates, generated independently with each data source, into a single prediction of ILI utilizing machine learning ensemble approaches. Our methodology exploits the information in each data source and produces accurate weekly ILI predictions for up to four weeks ahead of the release of CDC's ILI reports. We evaluate the predictive ability of our ensemble approach during the 2013-2014 (retrospective) and 2014-2015 (live) flu seasons for each of the four weekly time horizons. Our ensemble approach demonstrates several advantages: (1) our ensemble method's predictions outperform every prediction using each data source independently, (2) our methodology can produce predictions one week ahead of GFT's real-time estimates with comparable accuracy, and (3) our two and three week forecast estimates have comparable accuracy to real-time predictions using an autoregressive model. Moreover, our results show that considerable insight is gained from incorporating disparate data streams, in the form of social media and crowd sourced data, into influenza predictions in all time horizon

    Modeling to Predict Cases of Hantavirus Pulmonary Syndrome in Chile

    Get PDF
    Background: Hantavirus pulmonary syndrome (HPS) is a life threatening disease transmitted by the rodent Oligoryzomys longicaudatus in Chile. Hantavirus outbreaks are typically small and geographically confined. Several studies have estimated risk based on spatial and temporal distribution of cases in relation to climate and environmental variables, but few have considered climatological modeling of HPS incidence for monitoring and forecasting purposes. Methodology Monthly counts of confirmed HPS cases were obtained from the Chilean Ministry of Health for 2001–2012. There were an estimated 667 confirmed HPS cases. The data suggested a seasonal trend, which appeared to correlate with changes in climatological variables such as temperature, precipitation, and humidity. We considered several Auto Regressive Integrated Moving Average (ARIMA) time-series models and regression models with ARIMA errors with one or a combination of these climate variables as covariates. We adopted an information-theoretic approach to model ranking and selection. Data from 2001–2009 were used in fitting and data from January 2010 to December 2012 were used for one-step-ahead predictions. Results: We focused on six models. In a baseline model, future HPS cases were forecasted from previous incidence; the other models included climate variables as covariates. The baseline model had a Corrected Akaike Information Criterion (AICc) of 444.98, and the top ranked model, which included precipitation, had an AICc of 437.62. Although the AICc of the top ranked model only provided a 1.65% improvement to the baseline AICc, the empirical support was 39 times stronger relative to the baseline model. Conclusions: Instead of choosing a single model, we present a set of candidate models that can be used in modeling and forecasting confirmed HPS cases in Chile. The models can be improved by using data at the regional level and easily extended to other countries with seasonal incidence of HPS
    • …
    corecore