3 research outputs found
A model for fast web mining prototyping
Web mining is a computation intensive task, even after the mining tool itself has been developed. Most mining software are developed ad-hoc and usually are not scalable nor reused for other mining tasks. The objective of this paper is to present a model for fast Web mining prototyping, referred to as WIM â Web Information Mining. The underlying conceptual model of WIM provides its users with a level of abstraction appropriate for prototyping and experimentation throughout the Web data mining task. Abstracting from the idiosyncrasies of raw Web data representations facilitates the inherently iterative mining process. We present the WIM conceptual model, its associated algebra, and the WIM tool software architecture, which implements the WIM model. We also illustrate how the model can be applied to real Web data mining tasks. The experimentation of WIM in real use cases has shown to significantly facilitate Web mining prototyping. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applicationsâ data mining; D.2.m [Software Engineering]: Miscellaneousârapi
s-WIM : a scalable web information mining tool.
Programa de P?s-Gradua??o em Ci?ncia da Computa??o. Departamento de Ci?ncia da Computa??o, Instituto de Ci?ncias Exatas e Biol?gicas, Universidade Federal de Ouro Preto.Minera??o Web pode ser vista como o processo de encontrar padr?es na Web por meio de
t?cnicas de minera??o de dados. Minera??o Web ? uma tarefa computacionalmente intensiva, e a maioria dos softwares de minera??o s?o desenvolvidos isoladamente, o que torna
escalabilidade e reusabilidade dif??cil para outras tarefas de minera??o. Minera??o Web ?
um processo iterativo onde prototipagem tem um papel essencial para experimentar com
diferentes alternativas, bem como para incorporar o conhecimento adquirido em itera??es
anteriores do processo.
Web Information Mining (WIM) constitui um modelo para prototipagem r?pida em
minera??o Web. A principal motiva??o para o desenvolvimento do WIM foi o fato de que
seu modelo conceitual prov? seus usu?rios com um n??vel de abstra??o apropriado para
prototipagem e experimenta??o durante a tarefa de minera??o.
WIM ? composto de um modelo de dados e de uma ?lgebra. O modelo de dados WIM ?
uma vis?o relacional dos dados Web. Os tr?s tipos de dados existentes na Web, chamados
de conte?do, de estrutura e dados de uso, s?o representados por rela??es. Os principais
componentes de entrada do modelo de dados WIM s?o as p?ginas Web, a estrutura de hiper-
links que interliga as p?ginas Web, e os hist?ricos (logs) de consultas obtidos de m?quinas
de busca da Web. A programa??o WIM ? baseada em fluxos de dados (dataflows), onde
sequ?ncias de opera??es s?o aplicadas ?s rela??es. As opera??es s?o definidas pela ?lgebra
WIM, que cont?m operadores para manipula??o de dados e para minera??o de dados. WIM
materializa uma linguagem de programa??o declarativa provida por sua ?lgebra.
O objetivo do presente trabalho ? o desenho de software e o desenvolvimento do Scalable
Web Information Mining (s-WIM), a partir do modelo de dados e da ?lgebra apresentados pelo WIM. Para dotar os operadores com a escalabilidade desejada ? e consequentemente os programas gerados por eles ? o s-WIM foi desenvolvido sobre as plataformas
Apache Hadoop e Apache HBase, que prov?em escalabilidade linear tanto no armazenamento quanto no processamento de dados, a partir da adi??o de hardware.
A principal motiva??o para o desenvolvimento do s-WIM ? a falta de ferramentas livres
que ofere?am tanto o n??vel de abstra??o provido pela ?lgebra WIM quanto a escalabilidade
necess?ria ? opera??o sobre grandes bases de dados. Al?m disso, o n??vel de abstra??o
provido pela ?lgebra do WIM permite que usu?rios sem conhecimentos avan?ados em
linguagens de programa??o como Java ou C++ tamb?m possam utiliz?-lo.
O desenho e a arquitetura do s-WIM sobre o Hadoop e o HBase s?o apresentados
nesse trabalho, bem como detalhes de implementa??o dos operadores mais complexos. S?o
tamb?m apresentados diversos experimentos e seus resultados, que comprovam a escalabilidade do s-WIM e consequentemente, seu suporte ? minera??o de grandes volumes de
dados.Web mining can be seen as the process of discovering patterns from the Web by means
of data mining techniques. Web mining is a computation-intensive task and most mining
software is developed ad-hoc, which makes scalability and reusability difficult for other
mining tasks. Web mining is an iterative process and prototyping plays an essential role in
experimenting with different alternatives, as well as in incorporating knowledge acquired
in previous iterations of the process.
Web Information Mining (WIM) is a model for fast Web mining prototyping. The main
motivation behind WIM development was the fact that its conceptual model provides its
users with a high level of abstraction, appropriate for prototyping and experimenting during
the mining tasks.
WIM is composed by a data model and an algebra. The WIM data model is a relational view of Web data. The three types of existing Web data, namely Web content, Web
structure and Web usage, are represented by relations. The main input components for the
WIM data model are the Web pages, the hyperlink structure linking Web pages and the
query logs obtained from Web users? navigation. WIM materializes a declarative programming language from its algebra. The WIM programming language is based on dataflows,
where sequences of operations are applied to relations. The operations are defined by the
WIM algebra, which contains operators for data manipulation and for data mining.
The objective of this work is the software design and development of the Scalable Web
Information Mining (s-WIM), given the data model and the algebra originally presented by
WIM. In order to provide s-WIM operators with the intended scalability capabilities ? and
consequently the programs generated by them ? the s-WIM operators were developed on
top of Apache?s Hadoop and HBase, which provide linear scalability for both, data storage
and processing, by the addition of hardware resources.
The main motivation for s-WIM development is the lack of a free platform offering both,
the same high level of abstraction provided by the WIM algebra, and the scalability necessary for the operation on huge data volumes. Furthermore, the high level of abstraction
provided by the WIM algebra allows users without expertise in programming languages
such as Java or C++ to effectively use s-WIM.
The design and the architecture of s-WIM on top of Hadoop and HBase are presented in
this work, as well as details on the implementation of the most complex s-WIM operators.
This work also presents several experiments performed on s-WIM and their results, that
ascertain s-WIM scalability, and consequently, its support for the mining of huge data
volumes, including Web data sets