11,165 research outputs found
Bayes Merging of Multiple Vocabularies for Scalable Image Retrieval
The Bag-of-Words (BoW) representation is well applied to recent
state-of-the-art image retrieval works. Typically, multiple vocabularies are
generated to correct quantization artifacts and improve recall. However, this
routine is corrupted by vocabulary correlation, i.e., overlapping among
different vocabularies. Vocabulary correlation leads to an over-counting of the
indexed features in the overlapped area, or the intersection set, thus
compromising the retrieval accuracy. In order to address the correlation
problem while preserve the benefit of high recall, this paper proposes a Bayes
merging approach to down-weight the indexed features in the intersection set.
Through explicitly modeling the correlation problem in a probabilistic view, a
joint similarity on both image- and feature-level is estimated for the indexed
features in the intersection set.
We evaluate our method through extensive experiments on three benchmark
datasets. Albeit simple, Bayes merging can be well applied in various merging
tasks, and consistently improves the baselines on multi-vocabulary merging.
Moreover, Bayes merging is efficient in terms of both time and memory cost, and
yields competitive performance compared with the state-of-the-art methods.Comment: 8 pages, 7 figures, 6 tables, accepted to CVPR 201
Distributed Formal Concept Analysis Algorithms Based on an Iterative MapReduce Framework
While many existing formal concept analysis algorithms are efficient, they
are typically unsuitable for distributed implementation. Taking the MapReduce
(MR) framework as our inspiration we introduce a distributed approach for
performing formal concept mining. Our method has its novelty in that we use a
light-weight MapReduce runtime called Twister which is better suited to
iterative algorithms than recent distributed approaches. First, we describe the
theoretical foundations underpinning our distributed formal concept analysis
approach. Second, we provide a representative exemplar of how a classic
centralized algorithm can be implemented in a distributed fashion using our
methodology: we modify Ganter's classic algorithm by introducing a family of
MR* algorithms, namely MRGanter and MRGanter+ where the prefix denotes the
algorithm's lineage. To evaluate the factors that impact distributed algorithm
performance, we compare our MR* algorithms with the state-of-the-art.
Experiments conducted on real datasets demonstrate that MRGanter+ is efficient,
scalable and an appealing algorithm for distributed problems.Comment: 17 pages, ICFCA 201, Formal Concept Analysis 201
Information Integration - the process of integration, evolution and versioning
At present, many information sources are available wherever you are. Most of the time, the information needed is spread across several of those information sources. Gathering this information is a tedious and time consuming job. Automating this process would assist the user in its task. Integration of the information sources provides a global information source with all information needed present. All of these information sources also change over time. With each change of the information source, the schema of this source can be changed as well. The data contained in the information source, however, cannot be changed every time, due to the huge amount of data that would have to be converted in order to conform to the most recent schema.\ud
In this report we describe the current methods to information integration, evolution and versioning. We distinguish between integration of schemas and integration of the actual data. We also show some key issues when integrating XML data sources
Multimedia search without visual analysis: the value of linguistic and contextual information
This paper addresses the focus of this special issue by analyzing the potential contribution of linguistic content and other non-image aspects to the processing of audiovisual data. It summarizes the various ways in which linguistic content analysis contributes to enhancing the semantic annotation of multimedia content, and, as a consequence, to improving the effectiveness of conceptual media access tools. A number of techniques are presented, including the time-alignment of textual resources, audio and speech processing, content reduction and reasoning tools, and the exploitation of surface features
DataHub: Collaborative Data Science & Dataset Version Management at Scale
Relational databases have limited support for data collaboration, where teams
collaboratively curate and analyze large datasets. Inspired by software version
control systems like git, we propose (a) a dataset version control system,
giving users the ability to create, branch, merge, difference and search large,
divergent collections of datasets, and (b) a platform, DataHub, that gives
users the ability to perform collaborative data analysis building on this
version control system. We outline the challenges in providing dataset version
control at scale.Comment: 7 page
Multiple Retrieval Models and Regression Models for Prior Art Search
This paper presents the system called PATATRAS (PATent and Article Tracking,
Retrieval and AnalysiS) realized for the IP track of CLEF 2009. Our approach
presents three main characteristics: 1. The usage of multiple retrieval models
(KL, Okapi) and term index definitions (lemma, phrase, concept) for the three
languages considered in the present track (English, French, German) producing
ten different sets of ranked results. 2. The merging of the different results
based on multiple regression models using an additional validation set created
from the patent collection. 3. The exploitation of patent metadata and of the
citation structures for creating restricted initial working sets of patents and
for producing a final re-ranking regression model. As we exploit specific
metadata of the patent documents and the citation relations only at the
creation of initial working sets and during the final post ranking step, our
architecture remains generic and easy to extend
- âŠ