43 research outputs found
Recommended from our members
Advancing The Scope of Nontraditional Data Streams: Using Internet Data To Understand Human Behavior, Account For Bias, and Improve Forecasting
Internet data are pervasive. An almost incomprehensible amount of data1 are generated daily as a result of search engines, email, social media, and other web platforms[1]. Much of these data are publicly available and can provide direct insight into individual's thoughts and behaviors. In particular, many fields, including disease surveillance and political science, have found these data to be useful for predictive modeling.
Traditionally, predictive models have relied on data from official or academic sources including surveys, outbreak case studies, and official disease reports. While these data are often considered accurate, they are time consuming and inefficient to collect. In situations that require real-time decision making, stale official data can produce stale models. Internet data streams can compliment official reports to produce better, more comprehensive models.
This thesis contributes to multiple domains' understanding of when and how to use nontraditional data sources, the kinds of data available in online spaces, and appropriate methods to incorporate these data into downstream applications. Infuenza and Zika and the political peace process in Colombia from 2011-2017 are used as case studies. Using case studies from multiple domain spaces provides a new breadth of understanding of Internet data.
In particular, this thesis will:
(1) Use Internet data streams to identify human behaviors
(2) Develop methods to understand bias in Internet data streams and classification algorithms
(3) Evaluate usefulness of Internet data in various forecasting models
To answer these questions, I use large corpora of Internet data, including social media data. Methods are drawn from natural language processing, social computing, data mining, computational epidemiology, and machine learning literature.
Accurate models help decision makers make timely and effective decisions about interventions. Currently, models are a common first line consideration for decision makers at all levels of government, globally. Better understanding of available traces in nontraditional data streams, and more nuanced understanding of modeling decisions can immediately impact decision makers throughout our country and the globe2.
1 Estimated 2.5 exabytes daily in May 2018 [1].
2 The Los Alamos Unlimited Release number for this document is LA-UR-19-31417.</p
Estimating influenza incidence using search query deceptiveness and generalized ridge regression
Seasonal influenza is a sometimes surprisingly impactful disease, causing
thousands of deaths per year along with much additional morbidity. Timely
knowledge of the outbreak state is valuable for managing an effective response.
The current state of the art is to gather this knowledge using in-person
patient contact. While accurate, this is time-consuming and expensive. This has
motivated inquiry into new approaches using internet activity traces, based on
the theory that lay observations of health status lead to informative features
in internet data.
These approaches risk being deceived by activity traces having a
coincidental, rather than informative, relationship to disease incidence; to
our knowledge, this risk has not yet been quantitatively explored. We evaluated
both simulated and real activity traces of varying deceptiveness for influenza
incidence estimation using linear regression.
We found that deceptiveness knowledge does reduce error in such estimates,
that it may help automatically-selected features perform as well or better than
features that require human curation, and that a semantic distance measure
derived from the Wikipedia article category tree serves as a useful proxy for
deceptiveness. This suggests that disease incidence estimation models should
incorporate not only data about how internet features map to incidence but also
additional data to estimate feature deceptiveness. By doing so, we may gain one
more step along the path to accurate, reliable disease incidence estimation
using internet data. This capability would improve public health by decreasing
the cost and increasing the timeliness of such estimates.Comment: 27 pages, 8 figure
Epidemiological data challenges: planning for a more robust future through data standards
Accessible epidemiological data are of great value for emergency preparedness
and response, understanding disease progression through a population, and
building statistical and mechanistic disease models that enable forecasting.
The status quo, however, renders acquiring and using such data difficult in
practice. In many cases, a primary way of obtaining epidemiological data is
through the internet, but the methods by which the data are presented to the
public often differ drastically among institutions. As a result, there is a
strong need for better data sharing practices. This paper identifies, in detail
and with examples, the three key challenges one encounters when attempting to
acquire and use epidemiological data: 1) interfaces, 2) data formatting, and 3)
reporting. These challenges are used to provide suggestions and guidance for
improvement as these systems evolve in the future. If these suggested data and
interface recommendations were adhered to, epidemiological and public health
analysis, modeling, and informatics work would be significantly streamlined,
which can in turn yield better public health decision-making capabilities.Comment: v2 includes several typo fixes; v3 adds a paragraph on backfill; v4
adds 2 new paragraphs to the conclusion that address Frontiers reviewer
comments; v5 adds some minor modifications that address additional reviewer
comment
Salivary microbiomes of indigenous Tsimane mothers and infants are distinct despite frequent premastication
Background Premastication, the transfer of pre-chewed food, is a common infant and young child feeding practice among the Tsimane, forager-horticulturalists living in the Bolivian Amazon. Research conducted primarily with Western populations has shown that infants harbor distinct oral microbiota from their mothers. Premastication, which is less common in these populations, may influence the colonization and maturation of infant oral microbiota, including via transmission of oral pathogens. We collected premasticated food and saliva samples from Tsimane mothers and infants (9–24 months of age) to test for evidence of bacterial transmission in premasticated foods and overlap in maternal and infant salivary microbiota. We extracted bacterial DNA from two premasticated food samples and 12 matched salivary samples from maternal-infant pairs. DNA sequencing was performed with MiSeq (Illumina). We evaluated maternal and infant microbial composition in terms of relative abundance of specific taxa, alpha and beta diversity, and dissimilarity distances. Results The bacteria in saliva and premasticated food were mapped to 19 phyla and 400 genera and were dominated by Firmicutes, Proteobacteria, Actinobacteria, and Bacteroidetes. The oral microbial communities of Tsimane mothers and infants who frequently share premasticated food were well-separated in a non-metric multi-dimensional scaling ordination (NMDS) plot. Infant microbiotas clustered together, with weighted Unifrac distances significantly differing between mothers and infants. Infant saliva contained more Firmicutes (p < 0.01) and fewer Proteobacteria (p < 0.05) than did maternal saliva. Many genera previously associated with dental and periodontal infections, e.g. Neisseria, Gemella, Rothia, Actinomyces, Fusobacterium, and Leptotrichia, were more abundant in mothers than in infants. Conclusions Salivary microbiota of Tsimane infants and young children up to two years of age do not appear closely related to those of their mothers, despite frequent premastication and preliminary evidence that maternal bacteria is transmitted to premasticated foods. Infant physiology and diet may constrain colonization by maternal bacteria, including several oral pathogens
The Biosurveillance Analytics Resource Directory (BARD): Facilitating the Use of Epidemiological Models for Infectious Disease Surveillance
Epidemiological modeling for infectious disease is important for disease management and its routine implementation needs to be facilitated through better description of models in an operational context. A standardized model characterization process that allows selection or making manual comparisons of available models and their results is currently lacking. A key need is a universal framework to facilitate model description and understanding of its features. Los Alamos National Laboratory (LANL) has developed a comprehensive framework that can be used to characterize an infectious disease model in an operational context. The framework was developed through a consensus among a panel of subject matter experts. In this paper, we describe the framework, its application to model characterization, and the development of the Biosurveillance Analytics Resource Directory (BARD; http://brd.bsvgateway.org/brd/), to facilitate the rapid selection of operational models for specific infectious/communicable diseases. We offer this framework and associated database to stakeholders of the infectious disease modeling field as a tool for standardizing model description and facilitating the use of epidemiological models
Evaluation of Point of Need Diagnostic Tests for Use in California Influenza Outbreaks
Because of the potential threats flu viruses pose, the United States, like many developed countries, has a very well established flu surveillance system consisting of 10 components collecting laboratory data, mortality data, hospitalization data and sentinel outpatient care data. Currently, this surveillance system is estimated to lag behind the actual seasonal outbreak by one to two weeks. As new data streams come online, it is important to understand what added benefit they bring to the flu surveillance system complex. For data streams to be effective, they should provide data in a more timely fashion or provide additional data that current surveillance systems cannot provide. Two multiplexed diagnostic tools designed to test syndromically relevant pathogens and wirelessly upload data for rapid integration and interpretation were evaluated to see how they fit into the influenza surveillance scheme in California
Evaluation of Point of Need Diagnostic Tests for Use in California Influenza Outbreaks
Because of the potential threats flu viruses pose, the United States, like many developed countries, has a very well established flu surveillance system consisting of 10 components collecting laboratory data, mortality data, hospitalization data and sentinel outpatient care data. Currently, this surveillance system is estimated to lag behind the actual seasonal outbreak by one to two weeks. As new data streams come online, it is important to understand what added benefit they bring to the flu surveillance system complex. For data streams to be effective, they should provide data in a more timely fashion or provide additional data that current surveillance systems cannot provide. Two multiplexed diagnostic tools designed to test syndromically relevant pathogens and wirelessly upload data for rapid integration and interpretation were evaluated to see how they fit into the influenza surveillance scheme in California
Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited.
The ability to produce timely and accurate flu forecasts in the United States can significantly impact public health. Augmenting forecasts with internet data has shown promise for improving forecast accuracy and timeliness in controlled settings, but results in practice are less convincing, as models augmented with internet data have not consistently outperformed models without internet data. In this paper, we perform a controlled experiment, taking into account data backfill, to improve clarity on the benefits and limitations of augmenting an already good flu forecasting model with internet-based nowcasts. Our results show that a good flu forecasting model can benefit from the augmentation of internet-based nowcasts in practice for all considered public health-relevant forecasting targets. The degree of forecast improvement due to nowcasting, however, is uneven across forecasting targets, with short-term forecasting targets seeing the largest improvements and seasonal targets such as the peak timing and intensity seeing relatively marginal improvements. The uneven forecasting improvements across targets hold even when "perfect" nowcasts are used. These findings suggest that further improvements to flu forecasting, particularly seasonal targets, will need to derive from other, non-nowcasting approaches