111 research outputs found

    Post Processing Wrapper Generated Tables For Labeling Anonymous Datasets

    Get PDF
    A large number of wrappers generate tables without column names for human consumption because the meaning of the columns are apparent from the context and easy for humans to understand, but in emerging applications, labels are needed for autonomous assignment and schema mapping where machine tries to understand the tables. Autonomous label assignment is critical in volume data processing where ad hoc mediation, extraction and querying is involved. We propose an algorithm Lads for Labeling Anonymous Datasets, which can holistically label/annotate tabular Web document. The algorithm has been tested on anonymous datasets from a number of sites, yielding very promising results. We report here our experimental results on anonymous datasets from a number of sites e.g., music, movie, watch, political, automobile, synthetic obtained through different search engine such as Google, Yahoo and MSN. The comparative probabilities of attributes being candidate labels are presented which seem to be very promising, achieved as high as 98% probability of assigning good label to anonymous attribute. To the best of our knowledge, this is the first of its kind for label assignment based on multiple search engines\u27 recommendation. We have introduced a new paradigm, Web search engine based annotator which can holistically label tabular Web document. We categorize column into three types: disjoint set column (DSC), repeated prefix/suffix column (RPS) and numeric column (NUM). For labeling DSC column, our method rely on hit counts from Web search engine (e.g., Google, Yahoo and MSN). We formulate speculative queries to Web search engine and use the principle of disambiguation by maximal evidence to come up with our solution. Our algorithm Lads is guaranteed to work for the disjoint set column. Experimental results from large number of sites in different domains and subjective evaluation of our approach show that the proposed algorithm Lads works fairly well. In this line we claim that our algorithm Lads is robust. In order to assign label for the Disjoint Set Column, we need a candidate set of labels (e.g., label library) which can be collected on-the-fly from user SQL query variable as well as from Web Form label tag. We classify a set of homogeneous anonymous datasets into meaningful label and at the same time cluster those labels into a label library by learning user expectation and materialization of her expectation from a site. Previous work in this field rely on extraction ontologies, we eliminate the need for domain specific ontologies as we could extract label from the Web form. Our system is novel in the sense that we accommodate label from the user query variable. We hypothesize that our proposed algorithm Lads will do a good job for autonomous label assignment. We bridge the gap between two orthogonal research directions: wrapper generation and ontology generation from Web site (i.e., label extraction). We are NOT aware of any such prior work that address to connect these two orthogonal research for value added services such as online comparison shopping

    Judging Analogous Data Search In Resultant Web Databases

    Get PDF
    The present scenario is based on internet technologies we are having a huge amount of useful Information which is usually having on the web databases but in not retaive effectively at the time of users needed. Information retrieval is major criteria for the people However it is indeed on WDBs. So. The Web has become the accessible media for many database applications, such as e-commerce and search medias. These applications store information in huge databases that user’s access, query, and update through the Web.  Web sites have their own interfaces and access forms for creating HTML pages on the fly. Web database technologies define the way that these forms can connect to and retrieve data from database servers.     In this paper we present a novel approach for annotating web search on the search engines like MSN. It automatically searches data using cluster techniques and present classify the retrieved data

    Knowledge Rich Natural Language Queries over Structured Biological Databases

    Full text link
    Increasingly, keyword, natural language and NoSQL queries are being used for information retrieval from traditional as well as non-traditional databases such as web, document, image, GIS, legal, and health databases. While their popularity are undeniable for obvious reasons, their engineering is far from simple. In most part, semantics and intent preserving mapping of a well understood natural language query expressed over a structured database schema to a structured query language is still a difficult task, and research to tame the complexity is intense. In this paper, we propose a multi-level knowledge-based middleware to facilitate such mappings that separate the conceptual level from the physical level. We augment these multi-level abstractions with a concept reasoner and a query strategy engine to dynamically link arbitrary natural language querying to well defined structured queries. We demonstrate the feasibility of our approach by presenting a Datalog based prototype system, called BioSmart, that can compute responses to arbitrary natural language queries over arbitrary databases once a syntactic classification of the natural language query is made

    Twitter-demographer: a flow-based tool to enrich Twitter data

    Get PDF
    Twitter data have become essential to Natural Language Processing (NLP) and social science research, driving various scientific discoveries in recent years. However, the textual data alone are often not enough to conduct studies: especially social scientists need more variables to perform their analysis and control for various factors. How we augment this information, such as users' location, age, or tweet sentiment, has ramifications for anonymity and reproducibility, and requires dedicated effort. This paper describes Twitter-Demographer, a simple, flow-based tool to enrich Twitter data with additional information about tweets and users. Twitter-Demographer is aimed at NLP practitioners and (computational) social scientists who want to enrich their datasets with aggregated information, facilitating reproducibility, and providing algorithmic privacy-by-design measures for pseudo-anonymity. We discuss our design choices, inspired by the flow-based programming paradigm, to use black-box components that can easily be chained together and extended. We also analyze the ethical issues related to the use of this tool, and the built-in measures to facilitate pseudo-anonymity

    Creating ontology-based metadata by annotation for the semantic web

    Get PDF

    Semantic Web methods for knowledge management [online]

    Get PDF

    A widget library for creating policy-aware semantic Web applications

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 79-81).In order to truly reap the benefits of the Semantic Web, there must be adequate tools for writing Web applications that aggregate, view, and edit the widely varying data the Semantic Web makes available. As a step toward this goal, I introduce a Javascript widget library for creating Web applications that can both read from and write to the Semantic Web. In addition to providing widgets that perform editing operations, access control rules for user-generated content are supported using FOAF+SSL, a decentralized authentication technique, allowing for users to independently manage the restrictions placed on their data. I demonstrate this functionality with two examples: an aggregator application for exploring information about musicians from multiple data stores, and a universal annotation widget that allows users to make public and private comments about any resource on the Semantic Web.by James Dylan Hollenbach.M.Eng
    • …
    corecore