152 research outputs found
Efficient Document Indexing Using Pivot Tree
We present a novel method for efficiently searching top-k neighbors for
documents represented in high dimensional space of terms based on the cosine
similarity. Mostly, documents are stored as bag-of-words tf-idf representation.
One of the most used ways of computing similarity between a pair of documents
is cosine similarity between the vector representations, but cosine similarity
is not a metric distance measure as it doesn't follow triangle inequality,
therefore most metric searching methods can not be applied directly. We propose
an efficient method for indexing documents using a pivot tree that leads to
efficient retrieval. We also study the relation between precision and
efficiency for the proposed method and compare it with a state of the art in
the area of document searching based on inner product.Comment: 6 Pages, 2 Figure
Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency
Subsequence matching has appeared to be an ideal approach for solving many
problems related to the fields of data mining and similarity retrieval. It has
been shown that almost any data class (audio, image, biometrics, signals) is or
can be represented by some kind of time series or string of symbols, which can
be seen as an input for various subsequence matching approaches. The variety of
data types, specific tasks and their partial or full solutions is so wide that
the choice, implementation and parametrization of a suitable solution for a
given task might be complicated and time-consuming; a possibly fruitful
combination of fragments from different research areas may not be obvious nor
easy to realize. The leading authors of this field also mention the
implementation bias that makes difficult a proper comparison of competing
approaches. Therefore we present a new generic Subsequence Matching Framework
(SMF) that tries to overcome the aforementioned problems by a uniform frame
that simplifies and speeds up the design, development and evaluation of
subsequence matching related systems. We identify several relatively separate
subtasks solved differently over the literature and SMF enables to combine them
in straightforward manner achieving new quality and efficiency. This framework
can be used in many application domains and its components can be reused
effectively. Its strictly modular architecture and openness enables also
involvement of efficient solutions from different fields, for instance
efficient metric-based indexes. This is an extended version of a paper
published on DEXA 2012.Comment: This is an extended version of a paper published on DEXA 201
Record-Linkage from a Technical Point of View
TRecord linkage is used for preparing sampling frames, deduplication of lists and combining information on the same object from two different databases. If the identifiers of the same objects in two different databases have error free unique common identifiers like personal identification numbers (PID), record linkage is a simple file merge operation. If the identifiers contains errors, record linkage is a challenging task. In many applications, the files have widely different numbers of observations, for example a few thousand records of a sample survey and a few million records of an administrative database of social security numbers. Available software, privacy issues and future research topics are discussed.Record-Linkage, Data-mining, Privacy preserving protocols
Using Apache Lucene to Search Vector of Locally Aggregated Descriptors
Surrogate Text Representation (STR) is a profitable solution to efficient
similarity search on metric space using conventional text search engines, such
as Apache Lucene. This technique is based on comparing the permutations of some
reference objects in place of the original metric distance. However, the
Achilles heel of STR approach is the need to reorder the result set of the
search according to the metric distance. This forces to use a support database
to store the original objects, which requires efficient random I/O on a fast
secondary memory (such as flash-based storages). In this paper, we propose to
extend the Surrogate Text Representation to specifically address a class of
visual metric objects known as Vector of Locally Aggregated Descriptors (VLAD).
This approach is based on representing the individual sub-vectors forming the
VLAD vector with the STR, providing a finer representation of the vector and
enabling us to get rid of the reordering phase. The experiments on a publicly
available dataset show that the extended STR outperforms the baseline STR
achieving satisfactory performance near to the one obtained with the original
VLAD vectors.Comment: In Proceedings of the 11th Joint Conference on Computer Vision,
Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) -
Volume 4: VISAPP, p. 383-39
Ptolemaic Indexing
This paper discusses a new family of bounds for use in similarity search,
related to those used in metric indexing, but based on Ptolemy's inequality,
rather than the metric axioms. Ptolemy's inequality holds for the well-known
Euclidean distance, but is also shown here to hold for quadratic form metrics
in general, with Mahalanobis distance as an important special case. The
inequality is examined empirically on both synthetic and real-world data sets
and is also found to hold approximately, with a very low degree of error, for
important distances such as the angular pseudometric and several Lp norms.
Indexing experiments demonstrate a highly increased filtering power compared to
existing, triangular methods. It is also shown that combining the Ptolemaic and
triangular filtering can lead to better results than using either approach on
its own
Techniques for effective and efficient fire detection from social media images
Social media could provide valuable information to support decision making in
crisis management, such as in accidents, explosions and fires. However, much of
the data from social media are images, which are uploaded in a rate that makes
it impossible for human beings to analyze them. Despite the many works on image
analysis, there are no fire detection studies on social media. To fill this
gap, we propose the use and evaluation of a broad set of content-based image
retrieval and classification techniques for fire detection. Our main
contributions are: (i) the development of the Fast-Fire Detection method
(FFDnR), which combines feature extractor and evaluation functions to support
instance-based learning, (ii) the construction of an annotated set of images
with ground-truth depicting fire occurrences -- the FlickrFire dataset, and
(iii) the evaluation of 36 efficient image descriptors for fire detection.
Using real data from Flickr, our results showed that FFDnR was able to achieve
a precision for fire detection comparable to that of human annotators.
Therefore, our work shall provide a solid basis for further developments on
monitoring images from social media.Comment: 12 pages, Proceedings of the International Conference on Enterprise
Information Systems. Specifically: Marcos Bedo, Gustavo Blanco, Willian
Oliveira, Mirela Cazzolato, Alceu Costa, Jose Rodrigues, Agma Traina, Caetano
Traina, 2015, Techniques for effective and efficient fire detection from
social media images, ICEIS, 34-4
- …