Search CORE

2 research outputs found

Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

Author: Hirokawa Sachio
Ikeda Daisuke
Yamada Yasuhiro
山田泰寛
廣川佐千男
池田大輔
Publication venue: Springer
Publication date: 01/11/2001
Field of study

Discovery Science : 4th InternationalConference, DS 2001, Washington, DC, USA, November 25-28, 2001. ProceedingsWe propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any -gram is useless if it appears frequently. To decide an appropriate pair of length and frequency , we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent -grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats

Kyushu University Institutional Repository

Eliminating Useless Parts in Semi-structured Documents Using Alternation Counts

Author: Hirokawa Sachio
Ikeda Daisuke
Yamada Yasuhiro
イケダダイスケ
ヒロカワサチオ
ヤマダヤスヒロ
山田泰寛
廣川佐千男
池田大輔
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date
Field of study

We propose a preprocessing method for Web mining which, given semi-structured documents with the same structure and style, distinguishes useless parts and non-useless parts in each document without any knowledge on the documents. It is based on a simple idea that any -gram is useless if it appears frequently. To decide an appropriate pair of length and frequency , we introduce a new statistic measure alternation count. It is the number of alternations between useless parts and non-useless parts. Given news articles written in English or Japanese with some non-articles, the algorithm eliminates frequent -grams used for the structure and style of articles and extracts the news contents and headlines with more than 97% accuracy if articles are collected from the same site. Even if input articles are collected from different sites, the algorithm extracts contents of articles from these sites with at least 95% accuracy. Thus, the algorithm does not depend on the language, is robust for noises, and is applicable to multiple formats.Discovery Science : 4th InternationalConference, DS 2001, Washington, DC, USA, November 25-28, 2001. Proceeding

Institutional Repositories DataBase (IRDB)