23 research outputs found

    Democratic community-based search with XML full-text queries

    No full text
    As the web evolves, it is becoming easier to form online communities based on shared interests, and to create and publish data on a wide variety of topics. With this democratization of information creation, it is natural to query, in an ad-hoc and expressive fashion, the global collection that is the union of all local data collections of others within the community. In order to publish and locate documents of interest while fully delivering on the promise of free data exchange, any community-supporting infrastructure needs to enforce the key requirement to preserve privacy of the association of content providers with potential sensitive information. This privacy- preserving publishing requirement prevents censorship, harassment, or discrimination of users by third parties. It also precludes some obvious approaches that reuse and build on existing centralized technologies including search engines and hosted online communities. This dissertation facilitates democratization of data publishing and efficient search with powerful full-text queries over the community global collection by means of a novel distributed framework that disseminate queries in online communities. We address two challenging issues that arise in this context: the design of distributed access methods to publishers and the evaluation of expressive queries (i.e., XML full-text) locally at the publisher thereof. First, given the virtual nature of the global data collection, we study the problem of efficiently discovering publishers in the community that contain documents matching a user query. We call such peers relevant publishers. We propose a novel distributed infrastructure in which data resides only with the publishers owning it. The infrastructure disseminates user queries to publishers, who answer them at their own discretion, under data-location anonymity constraints. That is the query forwarding infrastructure prevents leaking information about which publishers are capable of answering a certain query. Second, once queries reach relevant publishers, we study how they efficiently process the incoming queries over their local repositories. Given that the commonly used data model for information exchange on the Web is semi-structured (e.g., XML), we propose algorithms for the evaluation and optimization of expressive XML queries that integrate structured and full- text search, including the W3C XQuery Full-Text standar

    Censorship-resistant Publishing

    No full text

    Rewriting Nested XML Queries Using Nested Views

    No full text

    WIKIANALYTICS: Ad-hoc Querying of Highly Heterogeneous Structured Data

    No full text
    Abstract — Searching and extracting meaningful information out of highly heterogeneous datasets is a hot topic that received a lot of attention. However, the existing solutions are based on either rigid complex query languages (e.g., SQL, XQuery/XPath) which are hard to use without full schema knowledge, without an expert user, and which require up-front data integration. At the other extreme, existing solutions employ keyword search queries over relational databases [3], [1], [10], [9], [2], [11] as well as over semistructured data [6], [12], [17], [15] which are too imprecise to specify exactly the user’s intent [16]. To address these limitations, we propose an alternative search paradigm in order to derive tables of precise and complete results from a very sparse set of heterogeneous records. Our approach allows users to disambiguate search results by navigation along conceptual dimensions that describe the records. Therefore, we cluster documents based on fields and values that contain the query keywords. We build a universal navigational lattice (UNL) over all such discovered clusters. Conceptually, the UNL encodes all possible ways to group the documents in the data corpus based on where the keywords hit. We describe, WIKIANALYTICS, a system that facilitates data extraction from the Wikipedia infobox collection. WIKIANALYT-ICS provides a dynamic and intuitive interface that lets the average user explore the search results and construct homogeneous structured tables, which can be further queried and mashed up (e.g., filtered and aggregated) using the conventional tools. I

    WikiAnalytics: Disambiguation of Keyword Search Results on Highly Heterogeneous Structured Data

    No full text
    Wikipedia infoboxes is an example of a seemingly structured, yet extraordinarily heterogeneous dataset, where any given record has only a tiny fraction of all possible fields. Such data cannot be queried using traditional means without a massive a priori integration effort, since even for a simple request the result values span many record types and fields. On the other hand, the solutions based on keyword search are too imprecise to capture user’s intent. To address these limitations, we propose a system, referred to herein as WIKIANALYTICS, that utilizes a novel search paradigm in order to derive tables of precise and complete results from Wikipedia infobox records. The user starts with a keyword search query that finds a superset of the result records, and then browses clusters of records deciding which are and are not relevant. WIKI-ANALYTICS uses three categories of clustering features based on record types, fields, and values that matched the query keywords, respectively. Since the system cannot predict which combination of features will be important to the user, it efficiently generates all possible clusters of records by all sets of features. We utilize a novel data structure, universal navigational lattice (UNL), that compactly encodes all possible clusters. WIKIANALYTICS provides a dynamic and intuitive interface that lets the user explore the UNL and construct homogeneous structured tables, which can be further queried and aggregated using the conventional tools. 1
    corecore