502 research outputs found

    Autonomous Consolidation of Heterogeneous Record-Structured HTML Data in Chameleon

    Get PDF
    While progress has been made in querying digital information contained in XML and HTML documents, success in retrieving information from the so called hidden Web (data behind Web forms) has been modest. There has been a nascent trend of developing autonomous tools for extracting information from the hidden Web. Automatic tools for ontology generation, wrapper generation, Weborm querying, response gathering, etc., have been reported in recent research. This thesis presents a system called Chameleon for automatic querying of and response gathering from the hidden Web. The approach to response gathering is based on automatic table structure identification, since most information repositories of the hidden Web are structured databases, and so the information returned in response to a query will have regularities. Information extraction from the identified record structures is performed based on domain knowledge corresponding to the domain specified in a query. So called domain plug-ins are used to make the dynamically generated wrappers domain-specific, rather than conventionally used document-specific

    WAQS : a web-based approximate query system

    Get PDF
    The Web is often viewed as a gigantic database holding vast stores of information and provides ubiquitous accessibility to end-users. Since its inception, the Internet has experienced explosive growth both in the number of users and the amount of content available on it. However, searching for information on the Web has become increasingly difficult. Although query languages have long been part of database management systems, the standard query language being the Structural Query Language is not suitable for the Web content retrieval. In this dissertation, a new technique for document retrieval on the Web is presented. This technique is designed to allow a detailed retrieval and hence reduce the amount of matches returned by typical search engines. The main objective of this technique is to allow the query to be based on not just keywords but also the location of the keywords within the logical structure of a document. In addition, the technique also provides approximate search capabilities based on the notion of Distance and Variable Length Don\u27t Cares. The proposed techniques have been implemented in a system, called Web-Based Approximate Query System, which contains an SQL-like query language called Web-Based Approximate Query Language. Web-Based Approximate Query Language has also been integrated with EnviroDaemon, an environmental domain specific search engine. It provides EnviroDaemon with more detailed searching capabilities than just keyword-based search. Implementation details, technical results and future work are presented in this dissertation

    Autonomous Consolidation of Heterogeneous Record-Structured HTML Data in Chameleon

    Get PDF
    While progress has been made in querying digital information contained in XML and HTML documents, success in retrieving information from the so called hidden Web (data behind Web forms) has been modest. There has been a nascent trend of developing autonomous tools for extracting information from the hidden Web. Automatic tools for ontology generation, wrapper generation, Weborm querying, response gathering, etc., have been reported in recent research. This thesis presents a system called Chameleon for automatic querying of and response gathering from the hidden Web. The approach to response gathering is based on automatic table structure identification, since most information repositories of the hidden Web are structured databases, and so the information returned in response to a query will have regularities. Information extraction from the identified record structures is performed based on domain knowledge corresponding to the domain specified in a query. So called domain plug-ins are used to make the dynamically generated wrappers domain-specific, rather than conventionally used document-specific

    Wrapper Maintenance: A Machine Learning Approach

    Full text link
    The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

    Automatic construction of wrappers for semi-structured documents.

    Get PDF
    Lin Wai-yip.Thesis (M.Phil.)--Chinese University of Hong Kong, 2001.Includes bibliographical references (leaves 114-123).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Information Extraction --- p.1Chapter 1.2 --- IE from Semi-structured Documents --- p.3Chapter 1.3 --- Thesis Contributions --- p.7Chapter 1.4 --- Thesis Organization --- p.9Chapter 2 --- Related Work --- p.11Chapter 2.1 --- Existing Approaches --- p.11Chapter 2.2 --- Limitations of Existing Approaches --- p.18Chapter 2.3 --- Our HISER Approach --- p.20Chapter 3 --- System Overview --- p.23Chapter 3.1 --- Hierarchical record Structure and Extraction Rule learning (HISER) --- p.23Chapter 3.2 --- Hierarchical Record Structure --- p.29Chapter 3.3 --- Extraction Rule --- p.29Chapter 3.4 --- Wrapper Adaptation --- p.32Chapter 4 --- Automatic Hierarchical Record Structure Construction --- p.34Chapter 4.1 --- Motivation --- p.34Chapter 4.2 --- Hierarchical Record Structure Representation --- p.36Chapter 4.3 --- Constructing Hierarchical Record Structure --- p.38Chapter 5 --- Extraction Rule Induction --- p.43Chapter 5.1 --- Rule Representation --- p.43Chapter 5.2 --- Extraction Rule Induction Algorithm --- p.47Chapter 6 --- Experimental Results of Wrapper Learning --- p.54Chapter 6.1 --- Experimental Methodology --- p.54Chapter 6.2 --- Results on Electronic Appliance Catalogs --- p.56Chapter 6.3 --- Results on Book Catalogs --- p.60Chapter 6.4 --- Results on Seminar Announcements --- p.62Chapter 7 --- Adapting Wrappers to Unseen Information Sources --- p.69Chapter 7.1 --- Motivation --- p.69Chapter 7.2 --- Support Vector Machines --- p.72Chapter 7.3 --- Feature Selection --- p.76Chapter 7.4 --- Automatic Annotation of Training Examples --- p.80Chapter 7.4.1 --- Building SVM Models --- p.81Chapter 7.4.2 --- Seeking Potential Training Example Candidates --- p.82Chapter 7.4.3 --- Classifying Potential Training Examples --- p.84Chapter 8 --- Experimental Results of Wrapper Adaptation --- p.86Chapter 8.1 --- Experimental Methodology --- p.86Chapter 8.2 --- Results on Electronic Appliance Catalogs --- p.89Chapter 8.3 --- Results on Book Catalogs --- p.93Chapter 9 --- Conclusions and Future Work --- p.97Chapter 9.1 --- Conclusions --- p.97Chapter 9.2 --- Future Work --- p.100Chapter A --- Sample Experimental Pages --- p.101Chapter B --- Detailed Experimental Results of Wrapper Adaptation of HISER --- p.109Bibliography --- p.11

    Doctor of Philosophy

    Get PDF
    dissertationMedical knowledge learned in medical school can become quickly outdated given the tremendous growth of the biomedical literature. It is the responsibility of medical practitioners to continuously update their knowledge with recent, best available clinical evidence to make informed decisions about patient care. However, clinicians often have little time to spend on reading the primary literature even within their narrow specialty. As a result, they often rely on systematic evidence reviews developed by medical experts to fulfill their information needs. At the present, systematic reviews of clinical research are manually created and updated, which is expensive, slow, and unable to keep up with the rapidly growing pace of medical literature. This dissertation research aims to enhance the traditional systematic review development process using computer-aided solutions. The first study investigates query expansion and scientific quality ranking approaches to enhance literature search on clinical guideline topics. The study showed that unsupervised methods can improve retrieval performance of a popular biomedical search engine (PubMed). The proposed methods improve the comprehensiveness of literature search and increase the ratio of finding relevant studies with reduced screening effort. The second and third studies aim to enhance the traditional manual data extraction process. The second study developed a framework to extract and classify texts from PDF reports. This study demonstrated that a rule-based multipass sieve approach is more effective than a machine-learning approach in categorizing document-level structures and iv that classifying and filtering publication metadata and semistructured texts enhances the performance of an information extraction system. The proposed method could serve as a document processing step in any text mining research on PDF documents. The third study proposed a solution for the computer-aided data extraction by recommending relevant sentences and key phrases extracted from publication reports. This study demonstrated that using a machine-learning classifier to prioritize sentences for specific data elements performs equally or better than an abstract screening approach, and might save time and reduce errors in the full-text screening process. In summary, this dissertation showed that there are promising opportunities for technology enhancement to assist in the development of systematic reviews. In this modern age when computing resources are getting cheaper and more powerful, the failure to apply computer technologies to assist and optimize the manual processes is a lost opportunity to improve the timeliness of systematic reviews. This research provides methodologies and tests hypotheses, which can serve as the basis for further large-scale software engineering projects aimed at fully realizing the prospect of computer-aided systematic reviews

    Automatically Learning User Needs from Online Reviews for New Product Design

    Get PDF
    The traditional product design process begins with the identification of user needs (Ulrich and Eppinger 2008). Traditional methods for needs identification include focus groups, surveys, interviews, and anthropological studies. In this paper, we propose to augment traditional methods for identifying user needs by automatically analyzing user-generated online product reviews. Specifically, we present a supervised, machine learning approach for sentential-level adaptive text extraction and mining. Based upon a set of 9700+ digital camera product reviews gathered in January 2008, we evaluate the approach in three ways. First, we report precision and recall using n-fold cross-validation on labeled data. Second, we compare the recall of automated learning with respect to traditional measures for identifying users and their respective needs. Third, we use multi-dimensional scaling (MDS) to visualize the competitive landscape by mapping existing products in terms of the user needs that they address
    • …
    corecore