9,028 research outputs found
A 2D based Partition Strategy for Solving Ranking under Team Context (RTP)
In this paper, we propose a 2D based partition method for solving the problem
of Ranking under Team Context(RTC) on datasets without a priori. We first map
the data into 2D space using its minimum and maximum value among all
dimensions. Then we construct window queries with consideration of current team
context. Besides, during the query mapping procedure, we can pre-prune some
tuples which are not top ranked ones. This pre-classified step will defer
processing those tuples and can save cost while providing solutions for the
problem. Experiments show that our algorithm performs well especially on large
datasets with correctness
Experiments in terabyte searching, genomic retrieval and novelty detection for TREC 2004
In TREC2004, Dublin City University took part in three tracks, Terabyte (in collaboration with University College Dublin), Genomic and Novelty. In this paper we will discuss each track separately and present separate conclusions from this work. In addition, we present a general description of a text retrieval engine that we have developed in the last year to support our experiments into large scale, distributed information retrieval, which underlies all of the track experiments described in this document
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
De Novo Assembly of Nucleotide Sequences in a Compressed Feature Space
Sequencing technologies allow for an in-depth analysis
of biological species but the size of the generated datasets
introduce a number of analytical challenges. Recently, we
demonstrated the application of numerical sequence representations
and data transformations for the alignment of short
reads to a reference genome. Here, we expand out approach
for de novo assembly of short reads. Our results demonstrate
that highly compressed data can encapsulate the signal suffi-
ciently to accurately assemble reads to big contigs or complete
genomes
Unsupervised Graph-based Rank Aggregation for Improved Retrieval
This paper presents a robust and comprehensive graph-based rank aggregation
approach, used to combine results of isolated ranker models in retrieval tasks.
The method follows an unsupervised scheme, which is independent of how the
isolated ranks are formulated. Our approach is able to combine arbitrary
models, defined in terms of different ranking criteria, such as those based on
textual, image or hybrid content representations.
We reformulate the ad-hoc retrieval problem as a document retrieval based on
fusion graphs, which we propose as a new unified representation model capable
of merging multiple ranks and expressing inter-relationships of retrieval
results automatically. By doing so, we claim that the retrieval system can
benefit from learning the manifold structure of datasets, thus leading to more
effective results. Another contribution is that our graph-based aggregation
formulation, unlike existing approaches, allows for encapsulating contextual
information encoded from multiple ranks, which can be directly used for
ranking, without further computations and post-processing steps over the
graphs. Based on the graphs, a novel similarity retrieval score is formulated
using an efficient computation of minimum common subgraphs. Finally, another
benefit over existing approaches is the absence of hyperparameters.
A comprehensive experimental evaluation was conducted considering diverse
well-known public datasets, composed of textual, image, and multimodal
documents. Performed experiments demonstrate that our method reaches top
performance, yielding better effectiveness scores than state-of-the-art
baseline methods and promoting large gains over the rankers being fused, thus
demonstrating the successful capability of the proposal in representing queries
based on a unified graph-based model of rank fusions
- âŠ