152 research outputs found

    Efficient Document Indexing Using Pivot Tree

    Full text link
    We present a novel method for efficiently searching top-k neighbors for documents represented in high dimensional space of terms based on the cosine similarity. Mostly, documents are stored as bag-of-words tf-idf representation. One of the most used ways of computing similarity between a pair of documents is cosine similarity between the vector representations, but cosine similarity is not a metric distance measure as it doesn't follow triangle inequality, therefore most metric searching methods can not be applied directly. We propose an efficient method for indexing documents using a pivot tree that leads to efficient retrieval. We also study the relation between precision and efficiency for the proposed method and compare it with a state of the art in the area of document searching based on inner product.Comment: 6 Pages, 2 Figure

    Generic Subsequence Matching Framework: Modularity, Flexibility, Efficiency

    Get PDF
    Subsequence matching has appeared to be an ideal approach for solving many problems related to the fields of data mining and similarity retrieval. It has been shown that almost any data class (audio, image, biometrics, signals) is or can be represented by some kind of time series or string of symbols, which can be seen as an input for various subsequence matching approaches. The variety of data types, specific tasks and their partial or full solutions is so wide that the choice, implementation and parametrization of a suitable solution for a given task might be complicated and time-consuming; a possibly fruitful combination of fragments from different research areas may not be obvious nor easy to realize. The leading authors of this field also mention the implementation bias that makes difficult a proper comparison of competing approaches. Therefore we present a new generic Subsequence Matching Framework (SMF) that tries to overcome the aforementioned problems by a uniform frame that simplifies and speeds up the design, development and evaluation of subsequence matching related systems. We identify several relatively separate subtasks solved differently over the literature and SMF enables to combine them in straightforward manner achieving new quality and efficiency. This framework can be used in many application domains and its components can be reused effectively. Its strictly modular architecture and openness enables also involvement of efficient solutions from different fields, for instance efficient metric-based indexes. This is an extended version of a paper published on DEXA 2012.Comment: This is an extended version of a paper published on DEXA 201

    Record-Linkage from a Technical Point of View

    Get PDF
    TRecord linkage is used for preparing sampling frames, deduplication of lists and combining information on the same object from two different databases. If the identifiers of the same objects in two different databases have error free unique common identifiers like personal identification numbers (PID), record linkage is a simple file merge operation. If the identifiers contains errors, record linkage is a challenging task. In many applications, the files have widely different numbers of observations, for example a few thousand records of a sample survey and a few million records of an administrative database of social security numbers. Available software, privacy issues and future research topics are discussed.Record-Linkage, Data-mining, Privacy preserving protocols

    Using Apache Lucene to Search Vector of Locally Aggregated Descriptors

    Full text link
    Surrogate Text Representation (STR) is a profitable solution to efficient similarity search on metric space using conventional text search engines, such as Apache Lucene. This technique is based on comparing the permutations of some reference objects in place of the original metric distance. However, the Achilles heel of STR approach is the need to reorder the result set of the search according to the metric distance. This forces to use a support database to store the original objects, which requires efficient random I/O on a fast secondary memory (such as flash-based storages). In this paper, we propose to extend the Surrogate Text Representation to specifically address a class of visual metric objects known as Vector of Locally Aggregated Descriptors (VLAD). This approach is based on representing the individual sub-vectors forming the VLAD vector with the STR, providing a finer representation of the vector and enabling us to get rid of the reordering phase. The experiments on a publicly available dataset show that the extended STR outperforms the baseline STR achieving satisfactory performance near to the one obtained with the original VLAD vectors.Comment: In Proceedings of the 11th Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2016) - Volume 4: VISAPP, p. 383-39

    Ptolemaic Indexing

    Full text link
    This paper discusses a new family of bounds for use in similarity search, related to those used in metric indexing, but based on Ptolemy's inequality, rather than the metric axioms. Ptolemy's inequality holds for the well-known Euclidean distance, but is also shown here to hold for quadratic form metrics in general, with Mahalanobis distance as an important special case. The inequality is examined empirically on both synthetic and real-world data sets and is also found to hold approximately, with a very low degree of error, for important distances such as the angular pseudometric and several Lp norms. Indexing experiments demonstrate a highly increased filtering power compared to existing, triangular methods. It is also shown that combining the Ptolemaic and triangular filtering can lead to better results than using either approach on its own

    Techniques for effective and efficient fire detection from social media images

    Get PDF
    Social media could provide valuable information to support decision making in crisis management, such as in accidents, explosions and fires. However, much of the data from social media are images, which are uploaded in a rate that makes it impossible for human beings to analyze them. Despite the many works on image analysis, there are no fire detection studies on social media. To fill this gap, we propose the use and evaluation of a broad set of content-based image retrieval and classification techniques for fire detection. Our main contributions are: (i) the development of the Fast-Fire Detection method (FFDnR), which combines feature extractor and evaluation functions to support instance-based learning, (ii) the construction of an annotated set of images with ground-truth depicting fire occurrences -- the FlickrFire dataset, and (iii) the evaluation of 36 efficient image descriptors for fire detection. Using real data from Flickr, our results showed that FFDnR was able to achieve a precision for fire detection comparable to that of human annotators. Therefore, our work shall provide a solid basis for further developments on monitoring images from social media.Comment: 12 pages, Proceedings of the International Conference on Enterprise Information Systems. Specifically: Marcos Bedo, Gustavo Blanco, Willian Oliveira, Mirela Cazzolato, Alceu Costa, Jose Rodrigues, Agma Traina, Caetano Traina, 2015, Techniques for effective and efficient fire detection from social media images, ICEIS, 34-4
    corecore