1,257 research outputs found
Keyword search in graphs, relational databases and social networks
Keyword search, a well known mechanism for retrieving relevant information from a set of documents, has recently been studied for extracting information from structured data (e.g., relational databases and XML documents). It offers an alternative way to query languages (e.g., SQL) to explore databases, which is effective for lay users who may not be familiar with the database schema or the query language. This dissertation addresses some issues in keyword search in structured data. Namely, novel solutions to existing problems in keyword search in graphs or relational databases are proposed. In addition, a problem related to graph keyword search, team formation in social networks, is studied. The dissertation consists of four parts.
The first part addresses keyword search over a graph which finds a substructure of the graph containing all or some of the query keywords. Current methods for keyword search over graphs may produce answers in which some content nodes (i.e., nodes that contain input keywords) are not very close to each other. In addition, current methods explore both content and non-content nodes while searching for the result and are thus both time and memory consuming for large graphs. To address the above problems, we propose algorithms for finding r-cliques in graphs. An r-clique is a group of content nodes that cover all the input keywords and the distance between each pair of nodes is less than or equal to r. Two approximation algorithms that produce r-cliques with a bounded approximation ratio in polynomial delay are proposed.
In the second part, the problem of duplication-free and minimal keyword search in graphs is studied. Current methods for keyword search in graphs may produce duplicate answers that contain the same set of content nodes. In addition, an answer found by these methods may not be minimal in the sense that some of the nodes in the answer may contain query keywords that are all covered by other nodes in the answer. Removing these nodes does not change the coverage of the answer but can make the answer more compact. We define the problem of finding duplication-free and minimal answers, and propose algorithms for finding such answers efficiently.
Meaningful keyword search in relational databases is the subject of the third part of this dissertation. Keyword search over relational databases returns a join tree spanning tuples containing the query keywords. As many answers of varying quality can be found, and the user is often only interested in seeing the·top-k answers, how to gauge the relevance of answers to rank them is of paramount importance. This becomes more pertinent for databases with large and complex schemas. We focus on the relevance of join trees as the fundamental means to rank the answers. We devise means to measure relevance of relations and foreign keys in the schema over the information content of the database.
The problem of keyword search over graph data is similar to the problem of team formation in social networks. In this setting, keywords represent skills and the nodes in a graph represent the experts that possess skills. Given an expert network, in which a node represents an expert that has a cost for using the expert service and an edge represents the communication cost between the two corresponding experts, we tackle the problem of finding a team of experts that covers a set of required skills and also minimizes the communication cost as well as the personnel cost of the team. We propose two types of approximation algorithms to solve this bi-criteria problem in the fourth part of this dissertation
Four Lessons in Versatility or How Query Languages Adapt to the Web
Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”
Recommended from our members
Beyond Similar Code: Leveraging Social Coding Websites
Programmers often write code with similarity to existing code written somewhere. Code search tools can help developers find similar solutions and identify possible improvements. For code search tools, good search results rely on valid data collection. Social coding websites, such as Question & Answer forum Stack Overflow (SO) and project repository GitHub, are popular destinations when programmers look for how to achieve certain programming tasks. Over the years, SO and GitHub have accumulated an enormous knowledge base of, and around, code. Since these software artifacts are publicly available, it is possible to leverage them in code search tools. This dissertation explores the opportunities of leveraging software artifacts from the social coding websites in searching for not just similar, but related, code. Programmers query SO and GitHub extensively to search for suitable code for reuse, however, not much is known about the usability or quality of the available code from each website. This dissertation first investigates under what circumstances the software artifacts found in social coding websites can be leveraged for purposes other than their immediate use by developers. It points out a number of problems that need to be addressed before those artifacts can be leveraged for code search and development tools. Specifically, triviality, fragility, and duplication, dominate these artifacts. However, when these problems are addressed, there is still a considerable amount of good quality artifacts that can be leveraged.SO and GitHub are not only two separate data resources, moreover, they together, belong to a larger system of software development process: the same users that rely on facilities of GitHub often seeks support on SO for their problems, and return to GitHub to apply the knowledge acquired. This dissertation further studies the crossover of software artifacts between SO and GitHub, and categorizes the adaptations from a SO code snippet to its GitHub counterparts. Existing search tools only recommend other code locations that are syntactically or semantically similar to the given code but do not reason about other kinds of relevant code that a developer should also pay attention to, e.g., auxiliary code to accomplish a complete task. With the good quality software artifacts and crossover between the two systems available, this dissertation presents two approaches that leverage these artifacts in searching for related code. Aroma indexes GitHub projects, takes a partial code snippet as input, searches the corpus for methods containing the partial code snippet, and clusters and intersects the results of the search to recommend. Aroma is evaluated on randomly selected queries created from the GitHub corpus, as well as queries derived from SO code snippets. It recommends related code for error checking and handling, objects configuring, etc. Furthermore, a user study is conducted where industrial developers are asked to complete programming tasks using Aroma and provide feedback. The results indicate that Aroma is capable of retrieving and recommending relevant code snippets efficiently. CodeAid reuses the crossover between SO and GitHub and recommends related code outside of a method body. For each SO snippet as a query, CodeAid retrieves the co-occurring code fragments for its GitHub counterparts and clusters them to recommend common ones. 74% of the common co-occurring code fragments represent related functionality that should be included in code search results. Three major types of relevancy--complementary, supplementary, and alternative methods, are identified
From unstructured HTML to structured XML: how XML supports financial knowledge management on internet.
by Yuen Lok-tin.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 88-95).Abstracts in English and Chinese.ABSTRACT --- p.I摘要 --- p.IIIACKNOWLEDGEMENT --- p.VTABLE OF CONTENTS --- p.VILIST OF FIGURES --- p.VIIILIST OF TABLES --- p.IXChapter 1 --- INTRODUCTION --- p.1Chapter 1.1 --- Background --- p.1Chapter 1.2 --- Objectives --- p.2Chapter 1.3 --- Organization --- p.4Chapter 2 --- LITERATURE REVIEW & THEORETICAL FOUNDATION --- p.6Chapter 2.1 --- "Data, Information and Knowledge" --- p.6Chapter 2.2 --- Knowledge Management --- p.7Chapter 2.3 --- Information Transparency and Efficiency --- p.10Chapter 2.3.1 --- Transparency --- p.11Chapter 2.3.2 --- Efficiency --- p.13Chapter 2.4 --- extensible markup language (XML) --- p.14Chapter 3 --- DIGITAL FINANCIAL INFORMATION AND ISSUES --- p.16Chapter 3.1 --- Managing Financial Information on the Internet --- p.17Chapter 3.2 --- Existing Electronic Financial Filing Systems --- p.20Chapter 3.3 --- Financial Document Disclosure Model --- p.21Chapter 3.4 --- Interaction Between Information Producers and Consumers --- p.23Chapter 3.5 --- Gluing All Together --- p.26Chapter 4 --- IDEAL ELECTRONIC FINANCIAL DISCLOSURE SYSTEM --- p.27Chapter 4.1 --- Structure and Representation of Knowledge --- p.28Chapter 4.2 --- Content Creation --- p.33Chapter 5 --- PROPOSED APPROACH --- p.36Chapter 5.1 --- Preliminary XML Data Dictionary --- p.36Chapter 5.2 --- Creation of XML Tags --- p.40Chapter 5.2.1 --- Statistical Information Retrieval --- p.41Chapter 5.2.2 --- Accounting and Auditing Practice --- p.43Chapter 5.2.3 --- Investors´ةFeedback --- p.44Chapter 5.3 --- Value-Added Services --- p.45Chapter 6 --- DESIGN AND DEVELOPMENT OF ELFFS-XML --- p.49Chapter 6.1 --- Stages of ELFFS-XML --- p.49Chapter 6.1.1 --- Information Creation --- p.49Chapter 6.1.2 --- Information Collection/Storage --- p.50Chapter 6.1.3 --- Knowledge Generation --- p.51Chapter 6.1.4 --- Knowledge Dissemination/Presentation --- p.52Chapter 6.1.5 --- Feedback --- p.52Chapter 6.2 --- Components of ELFFS-XML --- p.53Chapter 6.2.1 --- Data Source Abstraction Layer --- p.55Chapter 6.2.2 --- Storage Abstraction Layer --- p.57Chapter 6.2.3 --- Logic Layer --- p.61Chapter 6.2.4 --- Presentation Layer --- p.63Chapter 7 --- EVALUATING ELFFS-XML --- p.66Chapter 7.1 --- Comparison with Other Financial Information Disclosure Systems --- p.66Chapter 7.2 --- Users' Evaluation --- p.70Chapter 7.3 --- Systems Efficiency --- p.71Chapter 7.4 --- XML Tag Generation Approach Performance Evaluation --- p.73Chapter 8 --- CONCLUSION AND FUTURE RESEARCH --- p.78APPENDIX I SURVEY ON INVESTMENT PATTERN --- p.80APPENDIX II CORE ELFFS-XML DTD --- p.84APPENDIX III PERFORMANCE RELATED XML TAGS --- p.86BIBLIOGRAPHY --- p.8
Survey over Existing Query and Transformation Languages
A widely acknowledged obstacle for realizing the vision of the Semantic Web is the inability
of many current Semantic Web approaches to cope with data available in such diverging
representation formalisms as XML, RDF, or Topic Maps. A common query language is the first
step to allow transparent access to data in any of these formats. To further the understanding
of the requirements and approaches proposed for query languages in the conventional as well
as the Semantic Web, this report surveys a large number of query languages for accessing
XML, RDF, or Topic Maps. This is the first systematic survey to consider query languages from
all these areas. From the detailed survey of these query languages, a common classification
scheme is derived that is useful for understanding and differentiating languages within and
among all three areas
- …