Search CORE

2,693 research outputs found

Extracting News Events from Microblogs

Author: Ramampiaro Heri
Repp Øystein
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2018
Field of study

Twitter stream has become a large source of information for many people, but the magnitude of tweets and the noisy nature of its content have made harvesting the knowledge from Twitter a challenging task for researchers for a long time. Aiming at overcoming some of the main challenges of extracting the hidden information from tweet streams, this work proposes a new approach for real-time detection of news events from the Twitter stream. We divide our approach into three steps. The first step is to use a neural network or deep learning to detect news-relevant tweets from the stream. The second step is to apply a novel streaming data clustering algorithm to the detected news tweets to form news events. The third and final step is to rank the detected events based on the size of the event clusters and growth speed of the tweet frequencies. We evaluate the proposed system on a large, publicly available corpus of annotated news events from Twitter. As part of the evaluation, we compare our approach with a related state-of-the-art solution. Overall, our experiments and user-based evaluation show that our approach on detecting current (real) news events delivers a state-of-the-art performance

arXiv.org e-Print Archive

NORA - Norwegian Open Research Archives

Stream Aggregation Through Order Sampling

Author: Ahmed Nesreen
Duffield Nick
Xia Liangzhen
Xu Yunhong
Yu Minlan
Publication venue
Publication date: 01/11/2017
Field of study

This is paper introduces a new single-pass reservoir weighted-sampling stream aggregation algorithm, Priority-Based Aggregation (PBA). While order sampling is a powerful and e cient method for weighted sampling from a stream of uniquely keyed items, there is no current algorithm that realizes the benefits of order sampling in the context of stream aggregation over non-unique keys. A naive approach to order sample regardless of key then aggregate the results is hopelessly inefficient. In distinction, our proposed algorithm uses a single persistent random variable across the lifetime of each key in the cache, and maintains unbiased estimates of the key aggregates that can be queried at any point in the stream. The basic approach can be supplemented with a Sample and Hold pre-sampling stage with a sampling rate adaptation controlled by PBA. This approach represents a considerable reduction in computational complexity compared with the state of the art in adapting Sample and Hold to operate with a fixed cache size. Concerning statistical properties, we prove that PBA provides unbiased estimates of the true aggregates. We analyze the computational complexity of PBA and its variants, and provide a detailed evaluation of its accuracy on synthetic and trace data. Weighted relative error is reduced by 40% to 65% at sampling rates of 5% to 17%, relative to Adaptive Sample and Hold; there is also substantial improvement for rank queriesComment: 10 page

arXiv.org e-Print Archive

Crossref

Wrangling environmental exposure data: guidance for getting the best information from your laboratory measurements.

Author: Dodson Robin E
Perovich Laura J
Rudel Ruthann A
Udesky Julia O
Publication venue: eScholarship, University of California
Publication date: 01/11/2019
Field of study

BACKGROUND:Environmental health and exposure researchers can improve the quality and interpretation of their chemical measurement data, avoid spurious results, and improve analytical protocols for new chemicals by closely examining lab and field quality control (QC) data. Reporting QC data along with chemical measurements in biological and environmental samples allows readers to evaluate data quality and appropriate uses of the data (e.g., for comparison to other exposure studies, association with health outcomes, use in regulatory decision-making). However many studies do not adequately describe or interpret QC assessments in publications, leaving readers uncertain about the level of confidence in the reported data. One potential barrier to both QC implementation and reporting is that guidance on how to integrate and interpret QC assessments is often fragmented and difficult to find, with no centralized repository or summary. In addition, existing documents are typically written for regulatory scientists rather than environmental health researchers, who may have little or no experience in analytical chemistry. OBJECTIVES:We discuss approaches for implementing quality assurance/quality control (QA/QC) in environmental exposure measurement projects and describe our process for interpreting QC results and drawing conclusions about data validity. DISCUSSION:Our methods build upon existing guidance and years of practical experience collecting exposure data and analyzing it in collaboration with contract and university laboratories, as well as the Centers for Disease Control and Prevention. With real examples from our data, we demonstrate problems that would not have come to light had we not engaged with our QC data and incorporated field QC samples in our study design. Our approach focuses on descriptive analyses and data visualizations that have been compatible with diverse exposure studies with sample sizes ranging from tens to hundreds of samples. Future work could incorporate additional statistically grounded methods for larger datasets with more QC samples. CONCLUSIONS:This guidance, along with example table shells, graphics, and some sample R code, provides a useful set of tools for getting the best information from valuable environmental exposure datasets and enabling valid comparison and synthesis of exposure data across studies

DSpace@MIT

eScholarship - University of California