Search CORE

3,862 research outputs found

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence

Author: Thorup Mikkel
Publication venue
Publication date: 01/01/2013
Field of study

We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting of the k elements that are smallest according to a given hash function h. With this sample we can estimate the relative size f=|Y|/|X| of any subset Y as |S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard similarity f=|A intersect B|/|A union B| between sets A and B. Given the bottom-k samples from A and B, we construct the bottom-k sample of their union as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k. We show here that even if the hash function is only 2-independent, the expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a constant factor of the expected relative error with truly random hashing. For comparison, consider the classic approach of kxmin-wise where we use k hash independent functions h_1,...,h_k, storing the smallest element with each hash function. For kxmin-wise there is an at least constant bias with constant independence, and it is not reduced with larger k. Recently Feigenblat et al. showed that bottom-k circumvents the bias if the hash function is 8-independent and k is sufficiently large. We get down to 2-independence for any k. Our result is based on a simply union bound, transferring generic concentration bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger probability error bounds with higher independence. For weighted sets, we consider priority sampling which adapts efficiently to the concrete input weights, e.g., benefiting strongly from heavy-tailed input. This time, the analysis is much more involved, but again we show that generic concentration bounds can be applied.Comment: A short version appeared at STOC'1

arXiv.org e-Print Archive

CiteSeerX

Crossref

Copenhagen University Research Information System

Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence

Author: Thorup Mikkel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2013
Field of study

Simple tabulation dates back to Zobrist in 1970. Keys are viewed as c characters from some alphabet A. We initialize c tables h_0, ..., h_{c-1} mapping characters to random hash values. A key x=(x_0, ..., x_{c-1}) is hashed to h_0[x_0] xor...xor h_{c-1}[x_{c-1}]. The scheme is extremely fast when the character hash tables h_i are in cache. Simple tabulation hashing is not 4-independent, but we show that if we apply it twice, then we get high independence. First we hash to intermediate keys that are 6 times longer than the original keys, and then we hash the intermediate keys to the final hash values. The intermediate keys have d=6c characters from A. We can view the hash function as a degree d bipartite graph with keys on one side, each with edges to d output characters. We show that this graph has nice expansion properties, and from that we get that with another level of simple tabulation on the intermediate keys, the composition is a highly independent hash function. The independence we get is |A|^{Omega(1/c)}. Our space is O(c|A|) and the hash function is evaluated in O(c) time. Siegel [FOCS'89, SICOMP'04] proved that with this space, if the hash function is evaluated in o(c) time, then the independence can only be o(c), so our evaluation time is best possible for Omega(c) independence---our independence is much higher if c=|A|^{o(1)}. Siegel used O(c)^c evaluation time to get the same independence with similar space. Siegel's main focus was c=O(1), but we are exponentially faster when c=omega(1). Applying our scheme recursively, we can increase our independence to |A|^{Omega(1)} with o(c^{log c}) evaluation time. Compared with Siegel's scheme this is both faster and higher independence. Our scheme is easy to implement, and it does provide realistic implementations of 100-independent hashing for, say, 32 and 64-bit keys

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Fast and Powerful Hashing using Tabulation

Author: Thorup Mikkel
Publication venue
Publication date: 01/01/2016
Field of study

Randomized algorithms are often enjoyed for their simplicity, but the hash functions employed to yield the desired probabilistic guarantees are often too complicated to be practical. Here we survey recent results on how simple hashing schemes based on tabulation provide unexpectedly strong guarantees. Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as consisting of

c

characters and we have precomputed character tables

h_1,...,h_c

mapping characters to random hash values. A key

x=(x_1,...,x_c)

is hashed to

h_1[x_1] \oplus h_2[x_2].....\oplus h_c[x_c]

. This schemes is very fast with character tables in cache. While simple tabulation is not even 4-independent, it does provide many of the guarantees that are normally obtained via higher independence, e.g., linear probing and Cuckoo hashing. Next we consider twisted tabulation where one input character is "twisted" in a simple way. The resulting hash function has powerful distributional properties: Chernoff-Hoeffding type tail bounds and a very small bias for min-wise hashing. This also yields an extremely fast pseudo-random number generator that is provably good for many classic randomized algorithms and data-structures. Finally, we consider double tabulation where we compose two simple tabulation functions, applying one to the output of the other, and show that this yields very high independence in the classic framework of Carter and Wegman [1977]. In fact, w.h.p., for a given set of size proportional to that of the space consumed, double tabulation gives fully-random hashing. We also mention some more elaborate tabulation schemes getting near-optimal independence for given time and space. While these tabulation schemes are all easy to implement and use, their analysis is not

arXiv.org e-Print Archive

Copenhagen University Research Information System

Dagstuhl Research Online Publication Server

Comparative Considerations:Lagerlöf, Andersen - and the British perspective

Author: Thomsen Bjarne Thorup
Publication venue
Publication date: 01/01/2011
Field of study

Edinburgh Research Explorer

Combining agronomic and breeding approaches for improved nutrient use efficiency

Author: Thorup-Kristensen Kristian
Publication venue
Publication date: 01/09/2013
Field of study

There is a strong need to improve agricultural nutrient use efficiency (NUE), but NUE is complex, and not even well defined. The abstract and presentation deal with how NUE is determined by the combination of Genetic, Environmental and Management factors (GxExM), and how genetics as well as crop management must be combined in order to achieve improved overall NUE

Organic Eprints

An organic vegetable crop rotation aimed at self-sufficiency in nitrogen

Author: Thorup-Kristensen Kristian
Publication venue: DARCOF
Publication date: 01/01/1999
Field of study

The paper describes the organic vegetable crop rotation. The ideas behind the design of the crop rotation, the use of green manures and catch crops, and how information on crop root growth has been used to try to design a crop rotation with a high NUE and minimal N leaching losses. The results from the first years of the rotation, in terms of yield and N uptake of the crops and of the content of inorganic N in the soil are presented

Organic Eprints

Brassicas in sustainable production and organic farming

Author: Thorup-Kristensen Kristian
Publication venue
Publication date: 01/09/2008
Field of study

Brassica plant species show some characteristics in their use of plant nutrients which make them different from most other crops. These characteristics often make brassicas difficult to grow in low-input systems with limited nutrient availability, but at the same time they also make some brassica species valuable tools for reducing nitrate leaching losses and improving N management in farming systems. The paper presents experimental results on brassica crops as main crops and cover crops

Organic Eprints

Utilising differences in rooting depth to design vegetable crop rotations with high nitrogen use efficiency (NUE)

Author: Thorup-Kristensen K
Publication venue: 'International Society for Horticultural Science (ISHS)'
Publication date: 01/01/2002
Field of study

A number of methods involving plant or soil analysis or modelling have been developed to optimise N fertilization of vegetable crops. The methods aim at improving the NUE of each single crop, but do not really consider the crop rotation as such. Various measures can be used to increase the NUE of the crop rotation; measures that can be combined with the methods aimed at optimising NUE of each single crop. The aims of the paper are to discuss the methods for optimising NUE at the crop rotation level and to present examples of how this can be done. The main methods discussed are 1) how can crops with different rooting depth be optimally placed in a cropping sequence and 2) how can catch crops be introduced to optimise NUE. Results show that if N left in the soil after harvest on one crop is retained in the soil until spring, it will normally be found in deeper soil layers. Therefore rooting depth of the vegetable crops is important. It is illustrated that by placing deep-rooted crops in the crop rotation preferentially where much N was left in the soil in the previous year can strongly increase the utilisation of the N residues. It is also shown how catch crops can be used to maintain a high NUE, especially in situations where the farmers choose to grow shallow-rooted vegetables even though much N may be available in deeper soil layers

Organic Eprints

Effect of crop management practices on the sustainability and environmental impact of organic and low input food production systems

Author: Thorup-Kristensen K.
Publication venue
Publication date: 01/01/2007
Field of study

While organic farming can reduce many of the environmental problems caused by agriculture, organic farming also includes some practices which are questionable in terms of environmental effects. Organic farming practices (rotations, fertilisation regimes, cover crop use) can differ significantly and this leads to large differences in its environmental effects. This leaves considerable scope to improve the environmental effects of organic farming. The environmental aspects of organic farming are discussed, and model simulations are used to illustrate how even moderate changes in organic rotations can have large effects on sustainability, here measured by a simple index of nitrogen lost by leaching relative to nitrogen harvested by the crops. In WP3.3.4 we are working to improve model simulation of organic rotations, and in WP7.1 we are making environmental assessment of organic cropping practices tested in the QLIF project, using model simulations and other approaches

Organic Eprints