66,682 research outputs found

    Wrapper Maintenance: A Machine Learning Approach

    Full text link
    The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

    Mining Measured Information from Text

    Full text link
    We present an approach to extract measured information from text (e.g., a 1370 degrees C melting point, a BMI greater than 29.9 kg/m^2 ). Such extractions are critically important across a wide range of domains - especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property "pixel pitch" in the phrase "a pixel pitch as high as 352 {\mu}m"). Finally, we present MQSearch: the realization of a search engine with full support for measured information.Comment: 4 pages; 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15

    Rough sets theory for travel demand analysis in Malaysia

    Get PDF
    This study integrates the rough sets theory into tourism demand analysis. Originated from the area of Artificial Intelligence, the rough sets theory was introduced to disclose important structures and to classify objects. The Rough Sets methodology provides definitions and methods for finding which attributes separates one class or classification from another. Based on this theory can propose a formal framework for the automated transformation of data into knowledge. This makes the rough sets approach a useful classification and pattern recognition technique. This study introduces a new rough sets approach for deriving rules from information table of tourist in Malaysia. The induced rules were able to forecast change in demand with certain accuracy

    Warranty Data Analysis: A Review

    Get PDF
    Warranty claims and supplementary data contain useful information about product quality and reliability. Analysing such data can therefore be of benefit to manufacturers in identifying early warnings of abnormalities in their products, providing useful information about failure modes to aid design modification, estimating product reliability for deciding on warranty policy and forecasting future warranty claims needed for preparing fiscal plans. In the last two decades, considerable research has been conducted in warranty data analysis (WDA) from several different perspectives. This article attempts to summarise and review the research and developments in WDA with emphasis on models, methods and applications. It concludes with a brief discussion on current practices and possible future trends in WDA

    Identification of cellular automata based on incomplete observations with bounded time gaps

    Get PDF
    In this paper, the problem of identifying the cellular automata (CAs) is considered. We frame and solve this problem in the context of incomplete observations, i.e., prerecorded, incomplete configurations of the system at certain, and unknown time stamps. We consider 1-D, deterministic, two-state CAs only. An identification method based on a genetic algorithm with individuals of variable length is proposed. The experimental results show that the proposed method is highly effective. In addition, connections between the dynamical properties of CAs (Lyapunov exponents and behavioral classes) and the performance of the identification algorithm are established and analyzed
    corecore