17 research outputs found

    Query-Driven Sampling for Collective Entity Resolution

    Full text link
    Probabilistic databases play a preeminent role in the processing and management of uncertain data. Recently, many database research efforts have integrated probabilistic models into databases to support tasks such as information extraction and labeling. Many of these efforts are based on batch oriented inference which inhibits a realtime workflow. One important task is entity resolution (ER). ER is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One key observation is that, such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice because most users are interested in only a small subset of entities. In this paper, we advocate pay-as-you-go entity resolution by developing a number of query-driven collective ER techniques. We introduce two classes of SQL queries that involve ER operators --- selection-driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis Hastings algorithm to generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. Finally, we show that query-driven ER algorithms can converge and return results within minutes over a database populated with the extraction from a newswire dataset containing 71 million mentions

    Using Delay Tolerant Networks as a Backbone for Low-cost Smart Cities

    Full text link
    Rapid urbanization burdens city infrastructure and creates the need for local governments to maximize the usage of resources to serve its citizens. Smart city projects aim to alleviate the urbanization problem by deploying a vast amount of Internet-of-things (IoT) devices to monitor and manage environmental conditions and infrastructure. However, smart city projects can be extremely expensive to deploy and manage. A significant portion of the expense is a result of providing Internet connectivity via 5G or WiFi to IoT devices. This paper proposes the use of delay tolerant networks (DTNs) as a backbone for smart city communication; enabling developing communities to become smart cities at a fraction of the cost. A model is introduced to aid policy makers in designing and evaluating the expected performance of such networks. Preliminary results are presented based on a public transit network data-set from Chapel Hill, North Carolina. Finally, innovative ways of improving network performance in a low-cost smart city is discussed.Comment: 3 pages, accepted to IEEE SmartComp 201

    Towards Fair Disentangled Online Learning for Changing Environments

    Full text link
    In the problem of online learning for changing environments, data are sequentially received one after another over time, and their distribution assumptions may vary frequently. Although existing methods demonstrate the effectiveness of their learning algorithms by providing a tight bound on either dynamic regret or adaptive regret, most of them completely ignore learning with model fairness, defined as the statistical parity across different sub-population (e.g., race and gender). Another drawback is that when adapting to a new environment, an online learner needs to update model parameters with a global change, which is costly and inefficient. Inspired by the sparse mechanism shift hypothesis, we claim that changing environments in online learning can be attributed to partial changes in learned parameters that are specific to environments and the rest remain invariant to changing environments. To this end, in this paper, we propose a novel algorithm under the assumption that data collected at each time can be disentangled with two representations, an environment-invariant semantic factor and an environment-specific variation factor. The semantic factor is further used for fair prediction under a group fairness constraint. To evaluate the sequence of model parameters generated by the learner, a novel regret is proposed in which it takes a mixed form of dynamic and static regret metrics followed by a fairness-aware long-term constraint. The detailed analysis provides theoretical guarantees for loss regret and violation of cumulative fairness constraints. Empirical evaluations on real-world datasets demonstrate our proposed method sequentially outperforms baseline methods in model accuracy and fairness.Comment: Accepted by KDD 202

    M3: A Multi-Task Mixed-Objective Learning Framework for Open-Domain Multi-Hop Dense Sentence Retrieval

    Full text link
    In recent research, contrastive learning has proven to be a highly effective method for representation learning and is widely used for dense retrieval. However, we identify that relying solely on contrastive learning can lead to suboptimal retrieval performance. On the other hand, despite many retrieval datasets supporting various learning objectives beyond contrastive learning, combining them efficiently in multi-task learning scenarios can be challenging. In this paper, we introduce M3, an advanced recursive Multi-hop dense sentence retrieval system built upon a novel Multi-task Mixed-objective approach for dense text representation learning, addressing the aforementioned challenges. Our approach yields state-of-the-art performance on a large-scale open-domain fact verification benchmark dataset, FEVER. Code and data are available at: https://github.com/TonyBY/M3Comment: Accepted by LREC-COLING 202

    Social media captures demographic and regional physical activity

    No full text
    ObjectivesWe examined the use of data from social media for surveillance of physical activity prevalence in the USA.MethodsWe obtained data from the social media site Twitter from April 2015 to March 2016. The data consisted of 1 382 284 geotagged physical activity tweets from 481 146 users (55.7% men and 44.3% women) in more than 2900 counties. We applied machine learning and statistical modelling to demonstrate sex and regional variations in preferred exercises, and assessed the association between reports of physical activity on Twitter and population-level inactivity prevalence from the US Centers for Disease Control and Prevention.ResultsThe association between physical inactivity tweet patterns and physical activity prevalence varied by sex and region. Walking was the most popular physical activity for both men and women across all regions (15.94% (95% CI 15.85% to 16.02%) and 18.74% (95% CI 18.64% to 18.88%) of tweets, respectively). Men and women mentioned performing gym-based activities at approximately the same rates (4.68% (95% CI 4.63% to 4.72%) and 4.13% (95% CI 4.08% to 4.18%) of tweets, respectively). CrossFit was most popular among men (14.91% (95% CI 14.52% to 15.31%)) among gym-based tweets, whereas yoga was most popular among women (26.66% (95% CI 26.03% to 27.19%)). Men mentioned engaging in higher intensity activities than women. Overall, counties with higher physical activity tweets also had lower leisure-time physical inactivity prevalence for both sexes.ConclusionsThe regional-specific and sex-specific activity patterns captured on Twitter may allow public health officials to identify changes in health behaviours at small geographical scales and to design interventions best suited for specific populations
    corecore