6,758 research outputs found
TopSig: Topology Preserving Document Signatures
Performance comparisons between File Signatures and Inverted Files for text
retrieval have previously shown several significant shortcomings of file
signatures relative to inverted files. The inverted file approach underpins
most state-of-the-art search engine algorithms, such as Language and
Probabilistic models. It has been widely accepted that traditional file
signatures are inferior alternatives to inverted files. This paper describes
TopSig, a new approach to the construction of file signatures. Many advances in
semantic hashing and dimensionality reduction have been made in recent times,
but these were not so far linked to general purpose, signature file based,
search engines. This paper introduces a different signature file approach that
builds upon and extends these recent advances. We are able to demonstrate
significant improvements in the performance of signature file based indexing
and retrieval, performance that is comparable to that of state of the art
inverted file based systems, including Language models and BM25. These findings
suggest that file signatures offer a viable alternative to inverted files in
suitable settings and from the theoretical perspective it positions the file
signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201
Label Space Partition Selection for Multi-Object Tracking Using Two-Layer Partitioning
Estimating the trajectories of multi-objects poses a significant challenge
due to data association ambiguity, which leads to a substantial increase in
computational requirements. To address such problems, a divide-and-conquer
manner has been employed with parallel computation. In this strategy,
distinguished objects that have unique labels are grouped based on their
statistical dependencies, the intersection of predicted measurements. Several
geometry approaches have been used for label grouping since finding all
intersected label pairs is clearly infeasible for large-scale tracking
problems. This paper proposes an efficient implementation of label grouping for
label-partitioned generalized labeled multi-Bernoulli filter framework using a
secondary partitioning technique. This allows for parallel computation in the
label graph indexing step, avoiding generating and eliminating duplicate
comparisons. Additionally, we compare the performance of the proposed technique
with several efficient spatial searching algorithms. The results demonstrate
the superior performance of the proposed approach on large-scale data sets,
enabling scalable trajectory estimation.Comment: 6 pages, 4 figure
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
Boosting Image Forgery Detection using Resampling Features and Copy-move analysis
Realistic image forgeries involve a combination of splicing, resampling,
cloning, region removal and other methods. While resampling detection
algorithms are effective in detecting splicing and resampling, copy-move
detection algorithms excel in detecting cloning and region removal. In this
paper, we combine these complementary approaches in a way that boosts the
overall accuracy of image manipulation detection. We use the copy-move
detection method as a pre-filtering step and pass those images that are
classified as untampered to a deep learning based resampling detection
framework. Experimental results on various datasets including the 2017 NIST
Nimble Challenge Evaluation dataset comprising nearly 10,000 pristine and
tampered images shows that there is a consistent increase of 8%-10% in
detection rates, when copy-move algorithm is combined with different resampling
detection algorithms
Performance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling
Recent years have witnessed the drastic development of World Wide Web (WWW). Information is being accessible at the finger tip anytime anywhere through the massive web repository. The performance and reliability of web engines thus face huge problems due to the presence of enormous amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to the fact that the search results are of less relevance to the user. In addition to this, the presence of duplicate and near-duplicate web documents has created an additional overhead for the search engines critically affecting their performance. The demand for integrating data from heterogeneous sources leads to the problem of near-duplicate web pages. The detection of near duplicate documents within a collection has recently become an area of great interest. In this research, we have presented an efficient approach for the detection of near duplicate web pages in web crawling which uses keywords and the distance measure. Besides that, G.S. Manku et al.’s fingerprint based approach proposed in 2007 was considered as one of the “state-of-the-art" algorithms for finding near-duplicate web pages. Then we have implemented both the approaches and conducted an extensive comparative study between our similarity score based approach and G.S. Manku et al.’s fingerprint based approach. We have analyzed our results in terms of time complexity, space complexity, Memory usage and the confusion matrix parameters. After taking into account the above mentioned performance factors for the two approaches, the comparison study clearly portrays our approach the better (less complex) of the two based on the factors considered.DOI:http://dx.doi.org/10.11591/ijece.v2i6.1746
- …