111 research outputs found
Post Processing Wrapper Generated Tables For Labeling Anonymous Datasets
A large number of wrappers generate tables without column names for human consumption because the meaning of the columns are apparent from the context and easy for humans to understand, but in emerging applications, labels are needed for autonomous assignment and schema mapping where machine tries to understand the tables. Autonomous label assignment is critical in volume data processing where ad hoc mediation, extraction and querying is involved.
We propose an algorithm Lads for Labeling Anonymous Datasets, which can holistically label/annotate tabular Web document. The algorithm has been tested on anonymous datasets from a number of sites, yielding very promising results. We report here our experimental results on anonymous datasets from a number of sites e.g., music, movie, watch, political, automobile, synthetic obtained through different search engine such as Google, Yahoo and MSN. The comparative probabilities of attributes being candidate labels are presented which seem to be very promising, achieved as high as 98% probability of assigning good label to anonymous attribute. To the best of our knowledge, this is the first of its kind for label assignment based on multiple search engines\u27 recommendation. We have introduced a new paradigm, Web search engine based annotator which can holistically label tabular Web document. We categorize column into three types: disjoint set column (DSC), repeated prefix/suffix column (RPS) and numeric column (NUM). For labeling DSC column, our method rely on hit counts from Web search engine (e.g., Google, Yahoo and MSN). We formulate speculative queries to Web search engine and use the principle of disambiguation by maximal evidence to come up with our solution. Our algorithm Lads is guaranteed to work for the disjoint set column.
Experimental results from large number of sites in different domains and subjective evaluation of our approach show that the proposed algorithm Lads works fairly well. In this line we claim that our algorithm Lads is robust. In order to assign label for the Disjoint Set Column, we need a candidate set of labels (e.g., label library) which can be collected on-the-fly from user SQL query variable as well as from Web Form label tag. We classify a set of homogeneous anonymous datasets into meaningful label and at the same time cluster those labels into a label library by learning user expectation and materialization of her expectation from a site. Previous work in this field rely on extraction ontologies, we eliminate the need for domain specific ontologies as we could extract label from the Web form. Our system is novel in the sense that we accommodate label from the user query variable. We hypothesize that our proposed algorithm Lads will do a good job for autonomous label assignment. We bridge the gap between two orthogonal research directions: wrapper generation and ontology generation from Web site (i.e., label extraction). We are NOT aware of any such prior work that address to connect these two orthogonal research for value added services such as online comparison shopping
Judging Analogous Data Search In Resultant Web Databases
The present scenario is based on internet technologies we are having a huge amount of useful Information which is usually having on the web databases but in not retaive effectively at the time of users needed. Information retrieval is major criteria for the people However it is indeed on WDBs. So. The Web has become the accessible media for many database applications, such as e-commerce and search medias. These applications store information in huge databases that user’s access, query, and update through the Web. Web sites have their own interfaces and access forms for creating HTML pages on the fly. Web database technologies define the way that these forms can connect to and retrieve data from database servers.   In this paper we present a novel approach for annotating web search on the search engines like MSN. It automatically searches data using cluster techniques and present classify the retrieved data
Knowledge Rich Natural Language Queries over Structured Biological Databases
Increasingly, keyword, natural language and NoSQL queries are being used for
information retrieval from traditional as well as non-traditional databases
such as web, document, image, GIS, legal, and health databases. While their
popularity are undeniable for obvious reasons, their engineering is far from
simple. In most part, semantics and intent preserving mapping of a well
understood natural language query expressed over a structured database schema
to a structured query language is still a difficult task, and research to tame
the complexity is intense. In this paper, we propose a multi-level
knowledge-based middleware to facilitate such mappings that separate the
conceptual level from the physical level. We augment these multi-level
abstractions with a concept reasoner and a query strategy engine to dynamically
link arbitrary natural language querying to well defined structured queries. We
demonstrate the feasibility of our approach by presenting a Datalog based
prototype system, called BioSmart, that can compute responses to arbitrary
natural language queries over arbitrary databases once a syntactic
classification of the natural language query is made
Twitter-demographer: a flow-based tool to enrich Twitter data
Twitter data have become essential to Natural Language Processing (NLP) and
social science research, driving various scientific discoveries in recent
years. However, the textual data alone are often not enough to conduct studies:
especially social scientists need more variables to perform their analysis and
control for various factors. How we augment this information, such as users'
location, age, or tweet sentiment, has ramifications for anonymity and
reproducibility, and requires dedicated effort. This paper describes
Twitter-Demographer, a simple, flow-based tool to enrich Twitter data with
additional information about tweets and users. Twitter-Demographer is aimed at
NLP practitioners and (computational) social scientists who want to enrich
their datasets with aggregated information, facilitating reproducibility, and
providing algorithmic privacy-by-design measures for pseudo-anonymity. We
discuss our design choices, inspired by the flow-based programming paradigm, to
use black-box components that can easily be chained together and extended. We
also analyze the ethical issues related to the use of this tool, and the
built-in measures to facilitate pseudo-anonymity
A widget library for creating policy-aware semantic Web applications
Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 79-81).In order to truly reap the benefits of the Semantic Web, there must be adequate tools for writing Web applications that aggregate, view, and edit the widely varying data the Semantic Web makes available. As a step toward this goal, I introduce a Javascript widget library for creating Web applications that can both read from and write to the Semantic Web. In addition to providing widgets that perform editing operations, access control rules for user-generated content are supported using FOAF+SSL, a decentralized authentication technique, allowing for users to independently manage the restrictions placed on their data. I demonstrate this functionality with two examples: an aggregator application for exploring information about musicians from multiple data stores, and a universal annotation widget that allows users to make public and private comments about any resource on the Semantic Web.by James Dylan Hollenbach.M.Eng
- …