17 research outputs found
Query-Driven Sampling for Collective Entity Resolution
Probabilistic databases play a preeminent role in the processing and
management of uncertain data. Recently, many database research efforts have
integrated probabilistic models into databases to support tasks such as
information extraction and labeling. Many of these efforts are based on batch
oriented inference which inhibits a realtime workflow. One important task is
entity resolution (ER). ER is the process of determining records (mentions) in
a database that correspond to the same real-world entity. Traditional pairwise
ER methods can lead to inconsistencies and low accuracy due to localized
decisions. Leading ER systems solve this problem by collectively resolving all
records using a probabilistic graphical model and Markov chain Monte Carlo
(MCMC) inference. However, for large datasets this is an extremely expensive
process. One key observation is that, such exhaustive ER process incurs a huge
up-front cost, which is wasteful in practice because most users are interested
in only a small subset of entities. In this paper, we advocate pay-as-you-go
entity resolution by developing a number of query-driven collective ER
techniques. We introduce two classes of SQL queries that involve ER operators
--- selection-driven ER and join-driven ER. We implement novel variations of
the MCMC Metropolis Hastings algorithm to generate biased samples and
selectivity-based scheduling algorithms to support the two classes of ER
queries. Finally, we show that query-driven ER algorithms can converge and
return results within minutes over a database populated with the extraction
from a newswire dataset containing 71 million mentions
Using Delay Tolerant Networks as a Backbone for Low-cost Smart Cities
Rapid urbanization burdens city infrastructure and creates the need for local
governments to maximize the usage of resources to serve its citizens. Smart
city projects aim to alleviate the urbanization problem by deploying a vast
amount of Internet-of-things (IoT) devices to monitor and manage environmental
conditions and infrastructure. However, smart city projects can be extremely
expensive to deploy and manage. A significant portion of the expense is a
result of providing Internet connectivity via 5G or WiFi to IoT devices. This
paper proposes the use of delay tolerant networks (DTNs) as a backbone for
smart city communication; enabling developing communities to become smart
cities at a fraction of the cost. A model is introduced to aid policy makers in
designing and evaluating the expected performance of such networks. Preliminary
results are presented based on a public transit network data-set from Chapel
Hill, North Carolina. Finally, innovative ways of improving network performance
in a low-cost smart city is discussed.Comment: 3 pages, accepted to IEEE SmartComp 201
Towards Fair Disentangled Online Learning for Changing Environments
In the problem of online learning for changing environments, data are
sequentially received one after another over time, and their distribution
assumptions may vary frequently. Although existing methods demonstrate the
effectiveness of their learning algorithms by providing a tight bound on either
dynamic regret or adaptive regret, most of them completely ignore learning with
model fairness, defined as the statistical parity across different
sub-population (e.g., race and gender). Another drawback is that when adapting
to a new environment, an online learner needs to update model parameters with a
global change, which is costly and inefficient. Inspired by the sparse
mechanism shift hypothesis, we claim that changing environments in online
learning can be attributed to partial changes in learned parameters that are
specific to environments and the rest remain invariant to changing
environments. To this end, in this paper, we propose a novel algorithm under
the assumption that data collected at each time can be disentangled with two
representations, an environment-invariant semantic factor and an
environment-specific variation factor. The semantic factor is further used for
fair prediction under a group fairness constraint. To evaluate the sequence of
model parameters generated by the learner, a novel regret is proposed in which
it takes a mixed form of dynamic and static regret metrics followed by a
fairness-aware long-term constraint. The detailed analysis provides theoretical
guarantees for loss regret and violation of cumulative fairness constraints.
Empirical evaluations on real-world datasets demonstrate our proposed method
sequentially outperforms baseline methods in model accuracy and fairness.Comment: Accepted by KDD 202
M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval
In recent research, contrastive learning has proven to be a highly effective
method for representation learning and is widely used for dense retrieval.
However, we identify that relying solely on contrastive learning can lead to
suboptimal retrieval performance. On the other hand, despite many retrieval
datasets supporting various learning objectives beyond contrastive learning,
combining them efficiently in multi-task learning scenarios can be challenging.
In this paper, we introduce M3, an advanced recursive Multi-hop dense sentence
retrieval system built upon a novel Multi-task Mixed-objective approach for
dense text representation learning, addressing the aforementioned challenges.
Our approach yields state-of-the-art performance on a large-scale open-domain
fact verification benchmark dataset, FEVER. Code and data are available at:
https://github.com/TonyBY/M3Comment: Accepted by LREC-COLING 202
Social media captures demographic and regional physical activity
ObjectivesWe examined the use of data from social media for surveillance of physical activity prevalence in the USA.MethodsWe obtained data from the social media site Twitter from April 2015 to March 2016. The data consisted of 1 382 284 geotagged physical activity tweets from 481 146 users (55.7% men and 44.3% women) in more than 2900 counties. We applied machine learning and statistical modelling to demonstrate sex and regional variations in preferred exercises, and assessed the association between reports of physical activity on Twitter and population-level inactivity prevalence from the US Centers for Disease Control and Prevention.ResultsThe association between physical inactivity tweet patterns and physical activity prevalence varied by sex and region. Walking was the most popular physical activity for both men and women across all regions (15.94% (95% CI 15.85% to 16.02%) and 18.74% (95% CI 18.64% to 18.88%) of tweets, respectively). Men and women mentioned performing gym-based activities at approximately the same rates (4.68% (95% CI 4.63% to 4.72%) and 4.13% (95% CI 4.08% to 4.18%) of tweets, respectively). CrossFit was most popular among men (14.91% (95% CI 14.52% to 15.31%)) among gym-based tweets, whereas yoga was most popular among women (26.66% (95% CI 26.03% to 27.19%)). Men mentioned engaging in higher intensity activities than women. Overall, counties with higher physical activity tweets also had lower leisure-time physical inactivity prevalence for both sexes.ConclusionsThe regional-specific and sex-specific activity patterns captured on Twitter may allow public health officials to identify changes in health behaviours at small geographical scales and to design interventions best suited for specific populations