298 research outputs found
An LSH Index for Computing Kendall's Tau over Top-k Lists
We consider the problem of similarity search within a set of top-k lists
under the Kendall's Tau distance function. This distance describes how related
two rankings are in terms of concordantly and discordantly ordered items. As
top-k lists are usually very short compared to the global domain of possible
items to be ranked, creating an inverted index to look up overlapping lists is
possible but does not capture tight enough the similarity measure. In this
work, we investigate locality sensitive hashing schemes for the Kendall's Tau
distance and evaluate the proposed methods using two real-world datasets.Comment: 6 pages, 8 subfigures, presented in Seventeenth International
Workshop on the Web and Databases (WebDB 2014) co-located with ACM SIGMOD201
EquiX---A Search and Query Language for XML
EquiX is a search language for XML that combines the power of querying with
the simplicity of searching. Requirements for such languages are discussed and
it is shown that EquiX meets the necessary criteria. Both a graphical abstract
syntax and a formal concrete syntax are presented for EquiX queries. In
addition, the semantics is defined and an evaluation algorithm is presented.
The evaluation algorithm is polynomial under combined complexity.
EquiX combines pattern matching, quantification and logical expressions to
query both the data and meta-data of XML documents. The result of a query in
EquiX is a set of XML documents. A DTD describing the result documents is
derived automatically from the query.Comment: technical report of Hebrew University Jerusalem Israe
Taxonomy and clustering in collaborative systems: the case of the on-line encyclopedia Wikipedia
In this paper we investigate the nature and structure of the relation between
imposed classifications and real clustering in a particular case of a
scale-free network given by the on-line encyclopedia Wikipedia. We find a
statistical similarity in the distributions of community sizes both by using
the top-down approach of the categories division present in the archive and in
the bottom-up procedure of community detection given by an algorithm based on
the spectral properties of the graph. Regardless the statistically similar
behaviour the two methods provide a rather different division of the articles,
thereby signaling that the nature and presence of power laws is a general
feature for these systems and cannot be used as a benchmark to evaluate the
suitability of a clustering method.Comment: 5 pages, 3 figures, epl2 styl
A Look Back on the XML Benchmark Project
The XML Benchmark Project was started to provide a framework for evaluating the interplay of XML technologies and Database Management Systems. The benchmark lays emphasis on engineering aspects as well as on performance of the query processor. In this chapter the authors present a quick overview of the benchmark and point at some of the experience they gathered during the design of the benchmark and while running it on a variety of platforms. Since the benchmark was designed early in the evolution of XML, our experiences also reflect how the perception of XML changed during the three years that have passed since we started working on the subject. The chapter comprises an overview of the benchmark as well as discussions of some lessons learned
ENTITY EXTRACTION USING STATISTICAL METHODS USING INTERACTIVE KNOWLEDGE MINING FRAMEWORK
There are various kinds of valuable semantic information about real-world entities embedded in web pages and databases. Extracting and integrating these entity information from the Web is of great significance. Comparing to traditional information extraction problems, web entity extraction needs to solve several new challenges to fully take advantage of the unique characteristic of the Web. In this paper, we introduce our recent work on statistical extraction of structured entities, named entities, entity facts and relations from Web. We also briefly introduce iKnoweb, an interactive knowledge mining framework for entity information integration. We will use two novel web applications, Microsoft Academic Search (aka Libra) and EntityCube, as working examples
- …