Search CORE

2 research outputs found

Mining Multiple Web Sources Using Non-Deterministic Finite State Automata

Author: Harun-Or-Rashid Mohammad
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2012
Field of study

Existing web content extracting systems use unsupervised, supervised, and semi-supervised approaches. The WebOMiner system is an automatic web content data extraction system which models a specific Business to Customer (B2C) web site such as bestbuy.com using object oriented database schema. WebOMiner system extracts different web page content types like product, list, text using non deterministic finite automaton (NFA) generated manually. This thesis extends the automatic web content data extraction techniques proposed in the WebOMiner system to handle multiple web sites and generate integrated data warehouse automatically. We develop the WebOMiner-2 which generates NFA of specific domain classes from regular expressions extracted from web page DOM trees\u27 frequent patterns. Our algorithm can also handle NFA epsilon([varepsilon]) transition and convert it to deterministic finite automata (DFA) to identify different content tuples from list of tuples. Experimental results show that our system is highly effective and performs the content extraction task with 100% precision and 98.35% recall value

CiteSeerX

Scholarship at UWindsor

Optimization of materialization strategies for derived data elements

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref