15 research outputs found

    Encoding databases satisfying a given set of dependencies

    Get PDF
    Consider a relation schema with a set of dependency constraints. A fundamental question is what is the minimum space where the possible instances of the schema can be "stored". We study the following model. Encode the instances by giving a function which maps the set of possible instances into the set of words of a given length over the binary alphabet in a decodable way. The problem is to find the minimum length needed. This minimum is called the information content of the database. We investigate several cases where the set of dependency constraints consist of relatively simple sets of functional or multivalued dependencies. We also consider the following natural extension. Is it possible to encode the instances such a way that small changes in the instance cause a small change in the code. © 2012 Springer-Verlag

    Querying XML Documents Made Easy: Nearest Concept Queries

    Get PDF
    Due to the ubiquity and popularity of XML, users often are in the following situation: they want to query XML documents which contain potentially interesting information but they are unaware of the mark-up structure that is used. For example, it is easy to guess the contents of an XML bibliography file whereas the mark-up depends on the methodological, cultural and personal background of the author(s). Nonetheless, it is this hierarchical structure that forms the basis of XML query languages. In this paper we exploit the tree structure of XML documents to equip users with a powerful tool, the meet operator, that lets them query databases with whose content they are familiar, but without requiring knowledge of tags and hierarchies. Our approach is based on computing the lowest common ancestor of nodes in the XML syntax tree: eg, given two strings, we are looking for nodes whose offspring contains these two strings. The novelty of this approach is that the result type is unknown at query formulation time and dependent on the database instance. If the two strings are an author's name and a year, mainly publications of the author in this year are returned. If the two strings are numbers the result mostly consists of publications that have the numbers as year or page numbers. Because the result type of a query is not specified by the user we refer to the lowest common ancestor as nearest concept We also present a running example taken from the bibliography domain, and demonstrate that the operator can be implemented efficiently

    Web Mail Information Extraction

    Get PDF
    This project is conducted as to deliver the background of study, problem statements, objective, scope, literature review, methodology of choice for the development process, results and discussion, conclusion, recommendations and references used throughout its completion. The objective of this project is to extract relevant and useful information from Google Mail (GMail) by performing Information Extraction (IE) using Java progranuning language. After several testing have take place, the system developed is able to successfully extract relevant and useful information from GMail account and the emails come from different folders such as All Mail, Inbox, Drafts, Starred, Sent Mail, Spam and Trash. The focus is to extract email information such as the sender, recipient, subject and content. Those extracted information are presented in two mediums; as a text file or being stored inside database in order to better suit different users who come from different backgrounds and needs

    RecipeCrawler: Collecting Recipe Data from WWW Incrementally

    Full text link
    Abstract. WWW has posed itself as the largest data repository ever available in the history of humankind. Utilizing the Internet as a data source seems to be natural and many efforts have been made. In this paper we focus on establish-ing a robust system to collect structured recipe data from the Web incrementally, which, as we believe, is a critical step towards practical, continuous, reliable web data extraction systems and therefore utilizing WWW as data sources for various database applications. The reasons for advocating such an incremental approach are two-fold: (1) it is unpractical to crawl all the recipe pages from relevant web sites as the Web is highly dynamic; (2) it is almost impossible to induce a gen-eral wrapper for future extraction from the initial batch of recipe web pages. In this paper, we describe such a system called RecipeCrawler which targets at in-crementally collecting recipe data from WWW. General issues in establishing an incremental data extraction system are considered and techniques are applied to recipe data collection from the Web. Our RecipeCrawler is actually used as the backend of a fully-fledged multimedia recipe database system being developed jointly by City University of Hong Kong and Renmin University of China.

    Web Mail Information Extraction

    Get PDF
    This project is conducted as to deliver the background of study, problem statements, objective, scope, literature review, methodology of choice for the development process, results and discussion, conclusion, recommendations and references used throughout its completion. The objective of this project is to extract relevant and useful information from Google Mail (GMail) by performing Information Extraction (IE) using Java progranuning language. After several testing have take place, the system developed is able to successfully extract relevant and useful information from GMail account and the emails come from different folders such as All Mail, Inbox, Drafts, Starred, Sent Mail, Spam and Trash. The focus is to extract email information such as the sender, recipient, subject and content. Those extracted information are presented in two mediums; as a text file or being stored inside database in order to better suit different users who come from different backgrounds and needs

    Towards Comparative Web Content Mining using Object Oriented Model

    Get PDF
    Web content data are heterogeneous in nature; usually composed of different types of contents and data structure. Thus, extraction and mining of web content data is a challenging branch of data mining. Traditional web content extraction and mining techniques are classified into three categories: programming language based wrappers, wrapper (data extraction program) induction techniques, and automatic wrapper generation techniques. First category constructs data extraction system by providing some specialized pattern specification languages, second category is a supervised learning, which learns data extraction rules and third category is automatic extraction process. All these data extraction techniques rely on web document presentation structures, which need complicated matching and tree alignment algorithms, routine maintenance, hard to unify for vast variety of websites and fail to catch heterogeneous data together. To catch more diversity of web documents, a feasible implementation of an automatic data extraction technique based on object oriented data model technique, 00Web, had been proposed in Annoni and Ezeife (2009). This thesis implements, materializes and extends the structured automatic data extraction technique. We developed a system (called WebOMiner) for extraction and mining of structured web contents based on object-oriented data model. Thesis extends the extraction algorithms proposed by Annoni and Ezeife (2009) and develops an automata based automatic wrapper generation algorithm for extraction and mining of structured web content data. Our algorithm identifies data blocks from flat array data structure and generates Non-Deterministic Finite Automata (NFA) pattern for different types of content data for extraction. Objective of this thesis is to extract and mine heterogeneous web content and relieve the hard effort of matching, tree alignment and routine maintenance. Experimental results show that our system is highly effective and it performs the mining task with 100% precision and 96.22% recall value

    Focused image search in the social Web.

    Get PDF
    Recently, social multimedia-sharing websites, which allow users to upload, annotate, and share online photo or video collections, have become increasingly popular. The user tags or annotations constitute the new multimedia meta-data . We present an image search system that exploits both image textual and visual information. First, we use focused crawling and DOM Tree based web data extraction methods to extract image textual features from social networking image collections. Second, we propose the concept of visual words to handle the image\u27s visual content for fast indexing and searching. We also develop several user friendly search options to allow users to query the index using words and image feature descriptions (visual words). The developed image search system tries to bridge the gap between the scalable industrial image search engines, which are based on keyword search, and the slower content based image retrieval systems developed mostly in the academic field and designed to search based on image content only. We have implemented a working prototype by crawling and indexing over 16,056 images from flickr.com, one of the most popular image sharing websites. Our experimental results on a working prototype confirm the efficiency and effectiveness of the methods, that we proposed

    Semantic Interaction in Web-based Retrieval Systems : Adopting Semantic Web Technologies and Social Networking Paradigms for Interacting with Semi-structured Web Data

    Get PDF
    Existing web retrieval models for exploration and interaction with web data do not take into account semantic information, nor do they allow for new forms of interaction by employing meaningful interaction and navigation metaphors in 2D/3D. This thesis researches means for introducing a semantic dimension into the search and exploration process of web content to enable a significantly positive user experience. Therefore, an inherently dynamic view beyond single concepts and models from semantic information processing, information extraction and human-machine interaction is adopted. Essential tasks for semantic interaction such as semantic annotation, semantic mediation and semantic human-computer interaction were identified and elaborated for two general application scenarios in web retrieval: Web-based Question Answering in a knowledge-based dialogue system and semantic exploration of information spaces in 2D/3D

    Automatic construction of wrappers for semi-structured documents.

    Get PDF
    Lin Wai-yip.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 114-123).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Information Extraction --- p.1Chapter 1.2 --- IE from Semi-structured Documents --- p.3Chapter 1.3 --- Thesis Contributions --- p.7Chapter 1.4 --- Thesis Organization --- p.9Chapter 2 --- Related Work --- p.11Chapter 2.1 --- Existing Approaches --- p.11Chapter 2.2 --- Limitations of Existing Approaches --- p.18Chapter 2.3 --- Our HISER Approach --- p.20Chapter 3 --- System Overview --- p.23Chapter 3.1 --- Hierarchical record Structure and Extraction Rule learning (HISER) --- p.23Chapter 3.2 --- Hierarchical Record Structure --- p.29Chapter 3.3 --- Extraction Rule --- p.29Chapter 3.4 --- Wrapper Adaptation --- p.32Chapter 4 --- Automatic Hierarchical Record Structure Construction --- p.34Chapter 4.1 --- Motivation --- p.34Chapter 4.2 --- Hierarchical Record Structure Representation --- p.36Chapter 4.3 --- Constructing Hierarchical Record Structure --- p.38Chapter 5 --- Extraction Rule Induction --- p.43Chapter 5.1 --- Rule Representation --- p.43Chapter 5.2 --- Extraction Rule Induction Algorithm --- p.47Chapter 6 --- Experimental Results of Wrapper Learning --- p.54Chapter 6.1 --- Experimental Methodology --- p.54Chapter 6.2 --- Results on Electronic Appliance Catalogs --- p.56Chapter 6.3 --- Results on Book Catalogs --- p.60Chapter 6.4 --- Results on Seminar Announcements --- p.62Chapter 7 --- Adapting Wrappers to Unseen Information Sources --- p.69Chapter 7.1 --- Motivation --- p.69Chapter 7.2 --- Support Vector Machines --- p.72Chapter 7.3 --- Feature Selection --- p.76Chapter 7.4 --- Automatic Annotation of Training Examples --- p.80Chapter 7.4.1 --- Building SVM Models --- p.81Chapter 7.4.2 --- Seeking Potential Training Example Candidates --- p.82Chapter 7.4.3 --- Classifying Potential Training Examples --- p.84Chapter 8 --- Experimental Results of Wrapper Adaptation --- p.86Chapter 8.1 --- Experimental Methodology --- p.86Chapter 8.2 --- Results on Electronic Appliance Catalogs --- p.89Chapter 8.3 --- Results on Book Catalogs --- p.93Chapter 9 --- Conclusions and Future Work --- p.97Chapter 9.1 --- Conclusions --- p.97Chapter 9.2 --- Future Work --- p.100Chapter A --- Sample Experimental Pages --- p.101Chapter B --- Detailed Experimental Results of Wrapper Adaptation of HISER --- p.109Bibliography --- p.11
    corecore