3,862 research outputs found
Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence
We consider bottom-k sampling for a set X, picking a sample S_k(X) consisting
of the k elements that are smallest according to a given hash function h. With
this sample we can estimate the relative size f=|Y|/|X| of any subset Y as
|S_k(X) intersect Y|/k. A standard application is the estimation of the Jaccard
similarity f=|A intersect B|/|A union B| between sets A and B. Given the
bottom-k samples from A and B, we construct the bottom-k sample of their union
as S_k(A union B)=S_k(S_k(A) union S_k(B)), and then the similarity is
estimated as |S_k(A union B) intersect S_k(A) intersect S_k(B)|/k.
We show here that even if the hash function is only 2-independent, the
expected relative error is O(1/sqrt(fk)). For fk=Omega(1) this is within a
constant factor of the expected relative error with truly random hashing.
For comparison, consider the classic approach of kxmin-wise where we use k
hash independent functions h_1,...,h_k, storing the smallest element with each
hash function. For kxmin-wise there is an at least constant bias with constant
independence, and it is not reduced with larger k. Recently Feigenblat et al.
showed that bottom-k circumvents the bias if the hash function is 8-independent
and k is sufficiently large. We get down to 2-independence for any k. Our
result is based on a simply union bound, transferring generic concentration
bounds for the hashing scheme to the bottom-k sample, e.g., getting stronger
probability error bounds with higher independence.
For weighted sets, we consider priority sampling which adapts efficiently to
the concrete input weights, e.g., benefiting strongly from heavy-tailed input.
This time, the analysis is much more involved, but again we show that generic
concentration bounds can be applied.Comment: A short version appeared at STOC'1
Simple Tabulation, Fast Expanders, Double Tabulation, and High Independence
Simple tabulation dates back to Zobrist in 1970. Keys are viewed as c
characters from some alphabet A. We initialize c tables h_0, ..., h_{c-1}
mapping characters to random hash values. A key x=(x_0, ..., x_{c-1}) is hashed
to h_0[x_0] xor...xor h_{c-1}[x_{c-1}]. The scheme is extremely fast when the
character hash tables h_i are in cache. Simple tabulation hashing is not
4-independent, but we show that if we apply it twice, then we get high
independence. First we hash to intermediate keys that are 6 times longer than
the original keys, and then we hash the intermediate keys to the final hash
values.
The intermediate keys have d=6c characters from A. We can view the hash
function as a degree d bipartite graph with keys on one side, each with edges
to d output characters. We show that this graph has nice expansion properties,
and from that we get that with another level of simple tabulation on the
intermediate keys, the composition is a highly independent hash function. The
independence we get is |A|^{Omega(1/c)}.
Our space is O(c|A|) and the hash function is evaluated in O(c) time. Siegel
[FOCS'89, SICOMP'04] proved that with this space, if the hash function is
evaluated in o(c) time, then the independence can only be o(c), so our
evaluation time is best possible for Omega(c) independence---our independence
is much higher if c=|A|^{o(1)}.
Siegel used O(c)^c evaluation time to get the same independence with similar
space. Siegel's main focus was c=O(1), but we are exponentially faster when
c=omega(1).
Applying our scheme recursively, we can increase our independence to
|A|^{Omega(1)} with o(c^{log c}) evaluation time. Compared with Siegel's scheme
this is both faster and higher independence.
Our scheme is easy to implement, and it does provide realistic
implementations of 100-independent hashing for, say, 32 and 64-bit keys
Fast and Powerful Hashing using Tabulation
Randomized algorithms are often enjoyed for their simplicity, but the hash
functions employed to yield the desired probabilistic guarantees are often too
complicated to be practical. Here we survey recent results on how simple
hashing schemes based on tabulation provide unexpectedly strong guarantees.
Simple tabulation hashing dates back to Zobrist [1970]. Keys are viewed as
consisting of characters and we have precomputed character tables
mapping characters to random hash values. A key
is hashed to . This schemes is
very fast with character tables in cache. While simple tabulation is not even
4-independent, it does provide many of the guarantees that are normally
obtained via higher independence, e.g., linear probing and Cuckoo hashing.
Next we consider twisted tabulation where one input character is "twisted" in
a simple way. The resulting hash function has powerful distributional
properties: Chernoff-Hoeffding type tail bounds and a very small bias for
min-wise hashing. This also yields an extremely fast pseudo-random number
generator that is provably good for many classic randomized algorithms and
data-structures.
Finally, we consider double tabulation where we compose two simple tabulation
functions, applying one to the output of the other, and show that this yields
very high independence in the classic framework of Carter and Wegman [1977]. In
fact, w.h.p., for a given set of size proportional to that of the space
consumed, double tabulation gives fully-random hashing. We also mention some
more elaborate tabulation schemes getting near-optimal independence for given
time and space.
While these tabulation schemes are all easy to implement and use, their
analysis is not
Combining agronomic and breeding approaches for improved nutrient use efficiency
There is a strong need to improve agricultural nutrient use efficiency (NUE), but NUE is complex, and not even well defined. The abstract and presentation deal with how NUE is determined by the combination of Genetic, Environmental and Management factors (GxExM), and how genetics as well as crop management must be combined in order to achieve improved overall NUE
An organic vegetable crop rotation aimed at self-sufficiency in nitrogen
The paper describes the organic vegetable crop rotation. The ideas behind the design of the crop rotation, the use of green manures and catch crops, and how information on crop root growth has been used to try to design a crop rotation with a high NUE and minimal N leaching losses. The results from the first years of the rotation, in terms of yield and N uptake of the crops and of the content of inorganic N in the soil are presented
Brassicas in sustainable production and organic farming
Brassica plant species show some characteristics in their use of plant nutrients which make them different from most other crops. These characteristics often make brassicas difficult to grow in low-input systems with limited nutrient availability, but at the same time they also make some brassica species valuable tools for reducing nitrate leaching losses and improving N management in farming systems. The paper presents experimental results on brassica crops as main crops and cover crops
Utilising differences in rooting depth to design vegetable crop rotations with high nitrogen use efficiency (NUE)
A number of methods involving plant or soil analysis or modelling have been developed to optimise N fertilization of vegetable crops. The methods aim at improving the NUE of each single crop, but do not really consider the crop rotation as such. Various measures can be used to increase the NUE of the crop rotation; measures that can be combined with the methods aimed at optimising NUE of each single crop.
The aims of the paper are to discuss the methods for optimising NUE at the crop rotation level and to present examples of how this can be done. The main methods discussed are 1) how can crops with different rooting depth be optimally placed in a cropping sequence and 2) how can catch crops be introduced to optimise NUE.
Results show that if N left in the soil after harvest on one crop is retained in the soil until spring, it will normally be found in deeper soil layers. Therefore rooting depth of the vegetable crops is important. It is illustrated that by placing deep-rooted crops in the crop rotation preferentially where much N was left in the soil in the previous year can strongly increase the utilisation of the N residues.
It is also shown how catch crops can be used to maintain a high NUE, especially in situations where the farmers choose to grow shallow-rooted vegetables even though much N may be available in deeper soil layers
Effect of crop management practices on the sustainability and environmental impact of organic and low input food production systems
While organic farming can reduce many of the environmental problems caused by agriculture, organic farming also includes some practices which are questionable in terms of environmental effects. Organic farming practices (rotations, fertilisation regimes, cover crop use) can differ significantly and this leads to large differences in its environmental effects. This leaves considerable scope to improve the environmental effects of organic farming. The environmental aspects of organic farming are discussed, and model simulations are used to illustrate how even moderate changes in organic rotations can have large effects on sustainability, here measured by a simple index of nitrogen lost by leaching relative to nitrogen harvested by the crops. In WP3.3.4 we are working to improve model simulation of organic rotations, and in WP7.1 we are making environmental assessment of organic cropping practices tested in the QLIF project, using model simulations and other approaches
- …