Search CORE

66,682 research outputs found

Wrapper Maintenance: A Machine Learning Approach

Author: Knoblock C. A.
Lerman K.
Minton S. N.
Publication venue: 'AI Access Foundation'
Publication date: 23/06/2011
Field of study

The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

arXiv.org e-Print Archive

Crossref

Mining Measured Information from Text

Author: Maiya Arun S.
Visser Dale
Wan Andrew
Publication venue
Publication date: 05/05/2015
Field of study

We present an approach to extract measured information from text (e.g., a 1370 degrees C melting point, a BMI greater than 29.9 kg/m^2 ). Such extractions are critically important across a wide range of domains - especially those involving search and exploration of scientific and technical documents. We first propose a rule-based entity extractor to mine measured quantities (i.e., a numeric value paired with a measurement unit), which supports a vast and comprehensive set of both common and obscure measurement units. Our method is highly robust and can correctly recover valid measured quantities even when significant errors are introduced through the process of converting document formats like PDF to plain text. Next, we describe an approach to extracting the properties being measured (e.g., the property "pixel pitch" in the phrase "a pixel pitch as high as 352 {\mu}m"). Finally, we present MQSearch: the realization of a search engine with full support for measured information.Comment: 4 pages; 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '15

arXiv.org e-Print Archive

Crossref

Rough sets theory for travel demand analysis in Malaysia

Author: Wong Jenn Hwee
Publication venue
Publication date: 01/10/2008
Field of study

This study integrates the rough sets theory into tourism demand analysis. Originated from the area of Artificial Intelligence, the rough sets theory was introduced to disclose important structures and to classify objects. The Rough Sets methodology provides definitions and methods for finding which attributes separates one class or classification from another. Based on this theory can propose a formal framework for the automated transformation of data into knowledge. This makes the rough sets approach a useful classification and pattern recognition technique. This study introduces a new rough sets approach for deriving rules from information table of tourist in Malaysia. The induced rules were able to forecast change in demand with certain accuracy

Universiti Teknologi Malaysia Institutional Repository

Warranty Data Analysis: A Review

Author: Ahn
Alam
Attardi
Baik
Blischke
Blischke
Brennan
Buddhakulsomsiri
Buddhakulsomsiri
Chen
Chukova
Davis
Djamaludin
Duchesne
Elkins
Escobar
Fredette
Gertsbakh
Grabert
Honari
Hrycej
Hu
Hu
Hu
Hu
Ion
Iskandar
Jung
Kalbfleisch
Kalbfleisch
Kalbfleisch
Kalbfleisch
Kaminskiy
Karim
Karim
Karim
Karim
Kijima
Kleyner
Kleyner
Kleyner
Krivtsov
Lawless
Lawless
Lawless
Lawless
Lawless
Lawless
Majeske
Majeske
Majeske
Marcorin
Marshall
Meeker
Moskowitz
Murthy
Murthy
Murthy
Murthy
Oh
Pal
Phillips
Phillips
Phillips
Rahman
Rai
Rai
Rai
Robinson
Sahin
Singpurwalla
Singpurwalla
Sureka
Suzuki
Suzuki
Suzuki
Suzuki
Suzuki
Suzuki
Thomas
Thomas
Vinta
Vintr
Vittal
Wang
Wasserman
Wasserman
Wasserman
Wilson
Wu
Wu
Wu
Wu
Wu
Wu
Yang
Yang
Zuo
Publication venue: 'Wiley'
Publication date: 10/01/2012
Field of study

Warranty claims and supplementary data contain useful information about product quality and reliability. Analysing such data can therefore be of benefit to manufacturers in identifying early warnings of abnormalities in their products, providing useful information about failure modes to aid design modification, estimating product reliability for deciding on warranty policy and forecasting future warranty claims needed for preparing fiscal plans. In the last two decades, considerable research has been conducted in warranty data analysis (WDA) from several different perspectives. This article attempts to summarise and review the research and developments in WDA with emphasis on models, methods and applications. It concludes with a brief discussion on current practices and possible future trends in WDA

Crossref

Kent Academic Repository

Identification of cellular automata based on incomplete observations with bounded time gaps

Author: Baetens Jan
Bolt Witold Tadeusz
De Baets Bernard
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2020
Field of study

In this paper, the problem of identifying the cellular automata (CAs) is considered. We frame and solve this problem in the context of incomplete observations, i.e., prerecorded, incomplete configurations of the system at certain, and unknown time stamps. We consider 1-D, deterministic, two-state CAs only. An identification method based on a genetic algorithm with individuals of variable length is proposed. The experimental results show that the proposed method is highly effective. In addition, connections between the dynamical properties of CAs (Lyapunov exponents and behavioral classes) and the performance of the identification algorithm are established and analyzed

Ghent University Academic Bibliography