11,185 research outputs found
Local Differentially Private Heavy Hitter Detection in Data Streams with Bounded Memory
Top- frequent items detection is a fundamental task in data stream mining.
Many promising solutions are proposed to improve memory efficiency while still
maintaining high accuracy for detecting the Top- items. Despite the memory
efficiency concern, the users could suffer from privacy loss if participating
in the task without proper protection, since their contributed local data
streams may continually leak sensitive individual information. However, most
existing works solely focus on addressing either the memory-efficiency problem
or the privacy concerns but seldom jointly, which cannot achieve a satisfactory
tradeoff between memory efficiency, privacy protection, and detection accuracy.
In this paper, we present a novel framework HG-LDP to achieve accurate
Top- item detection at bounded memory expense, while providing rigorous
local differential privacy (LDP) protection. Specifically, we identify two key
challenges naturally arising in the task, which reveal that directly applying
existing LDP techniques will lead to an inferior ``accuracy-privacy-memory
efficiency'' tradeoff. Therefore, we instantiate three advanced schemes under
the framework by designing novel LDP randomization methods, which address the
hurdles caused by the large size of the item domain and by the limited space of
the memory. We conduct comprehensive experiments on both synthetic and
real-world datasets to show that the proposed advanced schemes achieve a
superior ``accuracy-privacy-memory efficiency'' tradeoff, saving
memory over baseline methods when the item domain size is . Our code is
open-sourced via the link
Time-aware topic recommendation based on micro-blogs
Topic recommendation can help users deal with the information overload issue in micro-blogging communities. This paper proposes to use the implicit information network formed by the multiple relationships among users, topics and micro-blogs, and the temporal information of micro-blogs to find semantically and temporally relevant topics of each topic, and to profile users' time-drifting topic interests. The Content based, Nearest Neighborhood based and Matrix Factorization models are used to make personalized recommendations. The effectiveness of the proposed approaches is demonstrated in the experiments conducted on a real world dataset that collected from Twitter.com
‘Where else is the money? A study of innovation in online business models at newspapers in Britain’s 66 cities’
Much like their counterparts in the United States and elsewhere, British newspaper publishers have seen a sharp decline in revenues from traditional sources—print advertising and copy sales—and many are intensifying efforts to generate new income by expanding their online offerings. A study of the largest circulation newspapers in the 66 cities in England, Scotland, Wales and Northern Ireland showed that while only a small minority did not have companion websites, many of the publishers who do have an online presence have transferred familiar revenue models. It has also been recognised that income from these sources is not enough to sustain current operations and innovative publishers have diversified into additional broad categories of Web business models. Significantly, this study did not only compare the approaches of various news publishers with each other, but it also considered how active newspaper publishers were in taking advantage of the variety of business models generally being employed on the Web—and which opportunities were ignored
CamFlow: Managed Data-sharing for Cloud Services
A model of cloud services is emerging whereby a few trusted providers manage
the underlying hardware and communications whereas many companies build on this
infrastructure to offer higher level, cloud-hosted PaaS services and/or SaaS
applications. From the start, strong isolation between cloud tenants was seen
to be of paramount importance, provided first by virtual machines (VM) and
later by containers, which share the operating system (OS) kernel. Increasingly
it is the case that applications also require facilities to effect isolation
and protection of data managed by those applications. They also require
flexible data sharing with other applications, often across the traditional
cloud-isolation boundaries; for example, when government provides many related
services for its citizens on a common platform. Similar considerations apply to
the end-users of applications. But in particular, the incorporation of cloud
services within `Internet of Things' architectures is driving the requirements
for both protection and cross-application data sharing.
These concerns relate to the management of data. Traditional access control
is application and principal/role specific, applied at policy enforcement
points, after which there is no subsequent control over where data flows; a
crucial issue once data has left its owner's control by cloud-hosted
applications and within cloud-services. Information Flow Control (IFC), in
addition, offers system-wide, end-to-end, flow control based on the properties
of the data. We discuss the potential of cloud-deployed IFC for enforcing
owners' dataflow policy with regard to protection and sharing, as well as
safeguarding against malicious or buggy software. In addition, the audit log
associated with IFC provides transparency, giving configurable system-wide
visibility over data flows. [...]Comment: 14 pages, 8 figure
DUET: A Generic Framework for Finding Special Quadratic Elements in Data Streams
Finding special items, like heavy hitters, top-k, and persistent items, has always been a hot issue in data stream processing for web analysis. While data streams nowadays are usually high-dimensional, most prior works focus on special items according to a certain primary dimension and yield little insight into the correlations between dimensions. Therefore, we propose to find special quadratic elements to reveal close correlations. Based on the items mentioned above, we extend our problem to three applications related to heavy hitters, top-k, and persistent items, and design a generic framework DUET to process them. Besides, we analyze the error bound of our algorithm and conduct extensive experiments on four data sets. Our experimental results show that DUET can achieve 3.5 times higher throughput and three orders of magnitude lower average relative error compared with cutting-edge algorithms
On Frequency Estimation and Detection of Heavy Hitters in Data Streams
A stream can be thought of as a very large set of data, sometimes even infinite, which arrives sequentially and must be processed without the possibility of being stored. In fact, the memory available to the algorithm is limited and it is not possible to store the whole stream of data which is instead scanned upon arrival and summarized through a succinct data structure in order to maintain only the information of interest. Two of the main tasks related to data stream processing are frequency estimation and heavy hitter detection. The frequency estimation problem requires estimating the frequency of each item, that is the number of times or the weight with which each appears in the stream, while heavy hitter detection means the detection of all those items with a frequency higher than a fixed threshold. In this work we design and analyze ACMSS, an algorithm for frequency estimation and heavy hitter detection, and compare it against the state of the art ASKETCH algorithm. We show that, given the same budgeted amount of memory, for the task of frequency estimation our algorithm outperforms ASKETCH with regard to accuracy. Furthermore, we show that, under the assumptions stated by its authors, ASKETCH may not be able to report all of the heavy hitters whilst ACMSS will provide with high probability the full list of heavy hitters
- …