19 research outputs found
BurrowsâWheeler postâtransformation with effective clustering and interpolative coding
Lossless compression methods based on the BurrowsâWheeler transform
(BWT) are regarded as an excellent compromise between speed and
compression efficiency: they provide compression rates close to the PPM
algorithms, with the speed of dictionaryâbased methods. Instead of the
laborious statisticsâgathering process used in PPM, the BWT reversibly
sorts the input symbols, using as the sort key as many following
characters as necessary to make the sort unique. Characters occurring in
similar contexts are sorted close together, resulting in a clustered
symbol sequence. Runâlength encoding and MoveâtoâFront (MTF) recoding,
combined with a statistical Huffman or arithmetic coder, is then
typically used to exploit the clustering. A drawback of the MTF recoding
is that knowledge of the character that produced the MTF number is
lost. In this paper, we present a new, competitive BurrowsâWheeler
posttransform stage that takes advantage of interpolative codingâa fast
binary encoding method for integer sequences, being able to exploit
clusters without requiring explicit statistics. We introduce a fast and
simple way to retain knowledge of the run characters during the MTF
recoding and use this to improve the clustering of MTF numbers and
runâlengths by applying reversible, stable sorting, with the run
characters as sort keys, achieving significant improvement in the
compression rate, as shown here by experiments on common text corpora.</p
Automatic detection of cereal rows by means of pattern recognition techniques
Automatic locating of weeds from fields is an active research topic in precision agriculture. A reliable and practical plant identification technique would enable the reduction of herbicide amounts and lowering of production costs, along with reducing the damage to the ecosystem. When the seeds have been sown row-wise, most weeds may be located between the sowing rows. The present work describes a clustering-based method for recognition of plantlet rows from a set of aerial photographs, taken by a drone flying at approximately ten meters. The algorithm includes three phases: segmentation of green objects in the view, feature extraction, and clustering of plants into individual rows. Segmentation separates the plants from the background. The main feature to be extracted is the center of gravity of each plant segment. A tentative clustering is obtained piecewise by applying the 2D Fourier transform to image blocks to get information about the direction and the distance between the rows. The precise sowing line position is finally derived by principal component analysis. The method was able to find the rows from a set of photographs of size 1452 x 969 pixels approximately in 0.11 s, with the accuracy of 94 per cent
Clustering of Shared Subobjects in Databases
The topic of this article is multi-criterion, structure-based clustering in objectoriented databases. We study an object class, which is the target (subobject) of several multi-valued reference types from other object classes. The aim is to serve all access paths fairly, so that the number of page accesses is proportional to the number of referenced occurrences of the subobject class. An efficient heuristic algorithm is developed for inserting new subobjects in an existing page set. Significant benefits can be obtained in read-intensive applications, if the confluent references are semantically correlated. Keywords: Clustering, Page allocation, Object-oriented databases 1 Introduction Clustering of objects (records) is necessary for effective database processing, as long as they are stored on conventional disks with slow random access. Three general kinds of clustering can be distinguished: 1. Content-based clustering: Objects sharing the same value for a certain attribute are placed..
Probabilistic Iterative Expansion of Candidates in Mining Frequent Itemsets
A simple new algorithm is suggested for frequent itemset mining, using item probabilities as the basis for generating candidates. The method first finds all the frequent items, and then generates an estimate of the frequent sets, assuming item independence. The candidates are stored in a trie where each path from the root to a node represents one candidate itemset. The method expands the trie iteratively, until all frequent itemsets are found. Expansion is based on scanning through the data set in each iteration cycle, and extending the subtries based on observed node frequencies. Trie probing can be restricted to only those nodes which possibly need extension. The number of candidates is usually quite moderate; for dense datasets 2-4 times the number of final frequent itemsets, for non-dense sets somewhat more. In practical experiments the method has been observed to make clearly fewer passes than the well-known Apriori method. As for speed, our non-optimised implementation is in some cases faster, in some others slower than the comparison methods. 1