16,110 research outputs found
Knowledge Discovery in Documents by Extracting Frequent Word Sequences
published or submitted for publicatio
siEDM: an efficient string index and search algorithm for edit distance with moves
Although several self-indexes for highly repetitive text collections exist,
developing an index and search algorithm with editing operations remains a
challenge. Edit distance with moves (EDM) is a string-to-string distance
measure that includes substring moves in addition to ordinal editing operations
to turn one string into another. Although the problem of computing EDM is
intractable, it has a wide range of potential applications, especially in
approximate string retrieval. Despite the importance of computing EDM, there
has been no efficient method for indexing and searching large text collections
based on the EDM measure. We propose the first algorithm, named string index
for edit distance with moves (siEDM), for indexing and searching strings with
EDM. The siEDM algorithm builds an index structure by leveraging the idea
behind the edit sensitive parsing (ESP), an efficient algorithm enabling
approximately computing EDM with guarantees of upper and lower bounds for the
exact EDM. siEDM efficiently prunes the space for searching query strings by
the proposed method, which enables fast query searches with the same guarantee
as ESP. We experimentally tested the ability of siEDM to index and search
strings on benchmark datasets, and we showed siEDM's efficiency.Comment: 23 page
XML Schema Clustering with Semantic and Hierarchical Similarity Measures
With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis
Harvesting Entities from the Web Using Unique Identifiers -- IBEX
In this paper we study the prevalence of unique entity identifiers on the
Web. These are, e.g., ISBNs (for books), GTINs (for commercial products), DOIs
(for documents), email addresses, and others. We show how these identifiers can
be harvested systematically from Web pages, and how they can be associated with
human-readable names for the entities at large scale.
Starting with a simple extraction of identifiers and names from Web pages, we
show how we can use the properties of unique identifiers to filter out noise
and clean up the extraction result on the entire corpus. The end result is a
database of millions of uniquely identified entities of different types, with
an accuracy of 73--96% and a very high coverage compared to existing knowledge
bases. We use this database to compute novel statistics on the presence of
products, people, and other entities on the Web.Comment: 30 pages, 5 figures, 9 tables. Complete technical report for A.
Talaika, J. A. Biega, A. Amarilli, and F. M. Suchanek. IBEX: Harvesting
Entities from the Web Using Unique Identifiers. WebDB workshop, 201
- …