982 research outputs found

    The Web as a Resource for Question Answering: Perspectives and Challenges

    Get PDF
    The vast amounts of information readily available on the World Wide Web can be effectively used for question answering in two fundamentally different ways. In the federated approach, techniques for handling semistructured data are applied to access Web sources as if they were databases, allowing large classes of common questions to be answered uniformly. In the distributed approach, largescale text-processing techniques are used to extract answers directly from unstructured Web documents. Because the Web is orders of magnitude larger than any human-collected corpus, question answering systems can capitalize on its unparalleled-levels of data redundancy. Analysis of real-world user questions reveals that the federated and distributed approaches complement each other nicely, suggesting a hybrid approach in future question answering systems

    Finding structure and characteristic of web documents for classification.

    Get PDF
    by Wong, Wai Ching.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 91-94).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgments --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- Semistructured Data --- p.2Chapter 1.2 --- Problem Addressed in the Thesis --- p.4Chapter 1.2.1 --- Labels and Values --- p.4Chapter 1.2.2 --- Discover Labels for the Same Attribute --- p.5Chapter 1.2.3 --- Classifying A Web Page --- p.6Chapter 1.3 --- Organization of the Thesis --- p.8Chapter 2 --- Background --- p.8Chapter 2.1 --- Related Work on Web Data --- p.8Chapter 2.1.1 --- Object Exchange Model (OEM) --- p.9Chapter 2.1.2 --- Schema Extraction --- p.11Chapter 2.1.3 --- Discovering Typical Structure --- p.15Chapter 2.1.4 --- Information Extraction of Web Data --- p.17Chapter 2.2 --- Automatic Text Processing --- p.19Chapter 2.2.1 --- Stopwords Elimination --- p.19Chapter 2.2.2 --- Stemming --- p.20Chapter 3 --- Web Data Definition --- p.22Chapter 3.1 --- Web Page --- p.22Chapter 3.2 --- Problem Description --- p.27Chapter 4 --- Hierarchical Structure --- p.32Chapter 4.1 --- Types of HTML Tags --- p.33Chapter 4.2 --- Tag-tree --- p.36Chapter 4.3 --- Hierarchical Structure Construction --- p.41Chapter 4.4 --- Hierarchical Structure Statistics --- p.50Chapter 5 --- Similar Labels Discovery --- p.53Chapter 5.1 --- Expression of Hierarchical Structure --- p.53Chapter 5.2 --- Labels Discovery Algorithm --- p.55Chapter 5.2.1 --- Phase 1: Remove Non-label Nodes --- p.57Chapter 5.2.2 --- Phase 2: Identify Label Nodes --- p.61Chapter 5.2.3 --- Phase 3: Discover Similar Labels --- p.66Chapter 5.3 --- Performance Evaluation of Labels Discovery Algorithm --- p.76Chapter 5.3.1 --- Phase 1 Results --- p.75Chapter 5.3.2 --- Phase 2 Results --- p.77Chapter 5.3.3 --- Phase 3 Results --- p.81Chapter 5.4 --- Classifying a Web Page --- p.83Chapter 5.4.1 --- Similarity Measurement --- p.84Chapter 5.4.2 --- Performance Evaluation --- p.86Chapter 6 --- Conclusion --- p.8

    Modeling tools for the integration of structured data sources

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, February 2011."December 2010." Cataloged from PDF version of thesis.Includes bibliographical references (p. 61-64).Disparity in representations within structured documents such as XML or SQL makes interoperability challenging, error-prone and expensive. A model is developed to process disparate representations to an encompassing generic knowledge representation. Data sources were characterized according to a number of smaller models: their case; the underlying data storage structures; a content model based on the ontological structure defined by the documents schema; and the data model or physical structure of the schema. In order to harmonize different representations and give them semantic meaning, from the above categories the representation is mapped to a common dictionary. The models were implemented as a structured data analysis tool and a basis was built to compare across schema and documents. Data exchange within modeling and simulation environments are increasingly in the form of XML using a variety of schema. Therefore, we demonstrate the use of this modeling tool to automatically harmonized multiple disparate XML data sources in a prototype simulated environment.by Jyotsna Venkataramanan.M.Eng

    A history and theory of textual event detection and recognition

    Get PDF

    e-DOCSPROS : exploring TEXPROS into e-business era

    Get PDF
    Document processing is a critical element of office automation. TEXPROS (TEXt PROcessing System) is a knowledge-based system designed to manage personal documents. However, as the Internet and e-Business changed the way offices operate, there is a need to re-envision document processing, storage, retrieval, and sharing. In the current environment, people must be able to access documents remotely and to share those documents with others. e-DOCPROS (e-DOCument PROcessing System) is a new document processing system that takes advantage of many of TEXPROS\u27s structures but adapts the system to this new environment. The new system is built to serve e-businesses, takes advantage of Internet protocols, and to give remote access and document sharing. e-DOCPROS meets the challenge to provide wider usage, and eventually will improve the efficiency and effectiveness of office automation. It allows end users to access their data through any Web browser with Internet access, even a wireless network, which will evolutionarily change the way we manage information. The application of e-DOCPROS to e-Business is considered. Four types of business models re considered here. The first is the Business-to-Business (B2B) model, which performs business-to-business transactions through an Extranet. The Extranet consists of multiple Intranets connected via the Internet.The second is the Business-to-Consumer (B2Q model, which performs business-to-consumer transactions through the Internet. The third is the Intranet model, which performs transactions within an organization through the organization\u27s network. The fourth is the Consumer-to-Consumer (C2C) model, which performs consumer-to consumer transactions through the Internet. A triple model is proposed in this dissertation to integrate organization type hierarchy and document type hierarchy together into folder organization. e-DOCPROS introduces new features into TEXPROS to support those four business models and to accommodate the system requirements. Extensible Markup Language (XML), an industrial standard protocol for data exchange, is employed to achieve the goal of information exchange between e-DOCPROS and the other systems, and also among the subsystems within e-DOCPROS. Document Object Model (DOM) specification is followed throughout the implementation of e-DOCPROS to achieve portability. Agent-based Application Service Provider (ASP) implementation is employed in e-DOCPROS system to achieve cost-effectiveness and accessibility
    • …
    corecore