5 research outputs found
EDNA-COVID: A Large-Scale Covid-19 Dataset Collected with the EDNA Streaming Toolkit
The Covid-19 pandemic has fundamentally altered many facets of our lives.
With nationwide lockdowns and stay-at-home advisories, conversations about the
pandemic have naturally moved to social networks, e.g. Twitter. This affords an
unprecedented insight into the evolution of social discourse in the presence of
a long-running destabilizing factor such as a pandemic with the high-volume,
high-velocity, high-noise Covid-19 Twitter feed. However, real-time information
extraction from such a data stream requires a fault-tolerant streaming
infrastructure to perform the non-trivial integration of heterogenous data
sources from news organizations, social feeds, and authoritative medical
organizations like the CDC. To address this, we present (i) the EDNA streaming
toolkit for consuming and processing streaming data, and (ii) EDNA-Covid, a
multilingual, large-scale dataset of coronavirus-related tweets collected with
EDNA since January 25, 2020. EDNA-Covid includes, at time of this publication,
over 600M tweets from around the world in over 10 languages. We release both
the EDNA toolkit and the EDNA-Covid dataset to the public so that they can be
used to extract valuable insights on this extraordinary social event
Concept Drift Detection and Adaptation with Weak Supervision on Streaming Unlabeled Data
Concept drift in learning and classification occurs when the statistical
properties of either the data features or target change over time; evidence of
drift has appeared in search data, medical research, malware, web data, and
video. Drift adaptation has not yet been addressed in high dimensional, noisy,
low-context data such as streaming text, video, or images due to the unique
challenges these domains present. We present a two-fold approach to deal with
concept drift in these domains: a density-based clustering approach to deal
with virtual concept drift (change in statistical properties of features) and a
weak-supervision step to deal with real concept drift (change in statistical
properties of target). Our density-based clustering avoids problems posed by
the curse of dimensionality to create an evolving 'map' of the live data space,
thereby addressing virtual drift in features. Our weak-supervision step
leverages high-confidence labels (oracle or heuristic labels) to generate
weighted training sets to generalize and update existing deep learners to adapt
to changing decision boundaries (real drift) and create new deep learners for
unseen regions of the data space. Our results show that our two-fold approach
performs well with >90% precision in 2018, four years after initial deployment
in 2014, without any human intervention
Event Detection in Noisy Streaming Data with Combination of Corroborative and Probabilistic Sources
Global physical event detection has traditionally relied on dense coverage of
physical sensors around the world; while this is an expensive undertaking,
there have not been alternatives until recently. The ubiquity of social
networks and human sensors in the field provides a tremendous amount of
real-time, live data about true physical events from around the world. However,
while such human sensor data have been exploited for retrospective large-scale
event detection, such as hurricanes or earthquakes, they has been limited to no
success in exploiting this rich resource for general physical event detection.
Prior implementation approaches have suffered from the concept drift
phenomenon, where real-world data exhibits constant, unknown, unbounded changes
in its data distribution, making static machine learning models ineffective in
the long term. We propose and implement an end-to-end collaborative drift
adaptive system that integrates corroborative and probabilistic sources to
deliver real-time predictions. Furthermore, out system is adaptive to concept
drift and performs automated continuous learning to maintain high performance.
We demonstrate our approach in a real-time demo available online for landslide
disaster detection, with extensibility to other real-world physical events such
as flooding, wildfires, hurricanes, and earthquakes
Robust, Extensible, and Fast: Teamed Classifiers for Vehicle Tracking and Vehicle Re-ID in Multi-Camera Networks
As camera networks have become more ubiquitous over the past decade, the
research interest in video management has shifted to analytics on multi-camera
networks. This includes performing tasks such as object detection, attribute
identification, and vehicle/person tracking across different cameras without
overlap. Current frameworks for management are designed for multi-camera
networks in a closed dataset environment where there is limited variability in
cameras and characteristics of the surveillance environment are well known.
Furthermore, current frameworks are designed for offline analytics with
guidance from human operators for forensic applications. This paper presents a
teamed classifier framework for video analytics in heterogeneous many-camera
networks with adversarial conditions such as multi-scale, multi-resolution
cameras capturing the environment with varying occlusion, blur, and
orientations. We describe an implementation for vehicle tracking and vehicle
re-identification (re-id), where we implement a zero-shot learning (ZSL) system
that performs automated tracking of all vehicles all the time. Our evaluations
on VeRi-776 and Cars196 show the teamed classifier framework is robust to
adversarial conditions, extensible to changing video characteristics such as
new vehicle types/brands and new cameras, and offers real-time performance
compared to current offline video analytics approaches
Challenges and Opportunities in Rapid Epidemic Information Propagation with Live Knowledge Aggregation from Social Media
A rapidly evolving situation such as the COVID-19 pandemic is a significant
challenge for AI/ML models because of its unpredictability. %The most reliable
indicator of the pandemic spreading has been the number of test positive cases.
However, the tests are both incomplete (due to untested asymptomatic cases) and
late (due the lag from the initial contact event, worsening symptoms, and test
results). Social media can complement physical test data due to faster and
higher coverage, but they present a different challenge: significant amounts of
noise, misinformation and disinformation. We believe that social media can
become good indicators of pandemic, provided two conditions are met. The first
(True Novelty) is the capture of new, previously unknown, information from
unpredictably evolving situations. The second (Fact vs. Fiction) is the
distinction of verifiable facts from misinformation and disinformation. Social
media information that satisfy those two conditions are called live knowledge.
We apply evidence-based knowledge acquisition (EBKA) approach to collect,
filter, and update live knowledge through the integration of social media
sources with authoritative sources. Although limited in quantity, the reliable
training data from authoritative sources enable the filtering of misinformation
as well as capturing truly new information. We describe the EDNA/LITMUS tools
that implement EBKA, integrating social media such as Twitter and Facebook with
authoritative sources such as WHO and CDC, creating and updating live knowledge
on the COVID-19 pandemic