25,485 research outputs found
Alternative approach to tree-structured web log representation and mining
More recent approaches to web log data representation aim to capture the user navigational patterns with respect to the overall structure of the web site. One such representation is tree-structured log files which is the focus of this work. Most existing methods for analyzing such data are based on the use of frequent subtree mining techniques to extract frequent user activity and navigational paths. In this paper we evaluate the use of other standard data mining techniques enabled by a recently proposed structure preserving flat data representation for tree-structured data. The initially proposed framework was adjusted to better suit the web log mining task. Experimental evaluation is performed on two real world web log datasets and comparisons are made with an existing state-of-the art classifier for tree-structured data. The results show the great potential of the method in enabling the application of a wider range of data mining/analysis techniques to tree-structured web log data
BlogForever D2.6: Data Extraction Methodology
This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform
Distributed-based massive processing of activity logs for efficient user modeling in a Virtual Campus
This paper reports on a multi-fold approach for the building of user models based on the identification of navigation patterns in a virtual campus, allowing for adapting the campusâ usability to the actual learnersâ needs, thus resulting in a great stimulation of the learning experience. However, user modeling in this context implies a constant processing and analysis of user interaction data during long-term learning activities, which produces huge amounts of valuable data stored typically in server log files. Due to the large or very large size of log files generated daily, the massive processing is a foremost step in extracting useful information. To this end, this work studies, first, the viability of processing large log data files of a real Virtual Campus using different distributed infrastructures. More precisely, we study the time performance of massive processing of daily log files implemented following the master-slave paradigm and evaluated using Cluster Computing and PlanetLab platforms. The study reveals the complexity and challenges of massive processing in the big data era, such as the need to carefully tune the log file processing in terms of chunk log data size to be processed at slave nodes as well as the bottleneck in processing in truly geographically distributed infrastructures due to the overhead caused by the communication time among the master and slave nodes. Then, an application of the massive processing approach resulting in log data processed and stored in a well-structured format is presented. We show how to extract knowledge from the log data analysis by using the WEKA framework for data mining purposes showing its usefulness to effectively build user models in terms of identifying interesting navigation patters of on-line learners. The study is motivated and conducted in the context of the actual data logs of the Virtual Campus of the Open University of Catalonia.Peer ReviewedPostprint (author's final draft
Argument Mining with Structured SVMs and RNNs
We propose a novel factor graph model for argument mining, designed for
settings in which the argumentative relations in a document do not necessarily
form a tree structure. (This is the case in over 20% of the web comments
dataset we release.) Our model jointly learns elementary unit type
classification and argumentative relation prediction. Moreover, our model
supports SVM and RNN parametrizations, can enforce structure constraints (e.g.,
transitivity), and can express dependencies between adjacent relations and
propositions. Our approaches outperform unstructured baselines in both web
comments and argumentative essay datasets.Comment: Accepted for publication at ACL 2017. 11 pages, 5 figures. Code at
https://github.com/vene/marseille and data at http://joonsuk.org
Artifact Lifecycle Discovery
Artifact-centric modeling is a promising approach for modeling business
processes based on the so-called business artifacts - key entities driving the
company's operations and whose lifecycles define the overall business process.
While artifact-centric modeling shows significant advantages, the overwhelming
majority of existing process mining methods cannot be applied (directly) as
they are tailored to discover monolithic process models. This paper addresses
the problem by proposing a chain of methods that can be applied to discover
artifact lifecycle models in Guard-Stage-Milestone notation. We decompose the
problem in such a way that a wide range of existing (non-artifact-centric)
process discovery and analysis methods can be reused in a flexible manner. The
methods presented in this paper are implemented as software plug-ins for ProM,
a generic open-source framework and architecture for implementing process
mining tools
Normalized Web Distance and Word Similarity
There is a great deal of work in cognitive psychology, linguistics, and
computer science, about using word (or phrase) frequencies in context in text
corpora to develop measures for word similarity or word association, going back
to at least the 1960s. The goal of this chapter is to introduce the
normalizedis a general way to tap the amorphous low-grade knowledge available
for free on the Internet, typed in by local users aiming at personal
gratification of diverse objectives, and yet globally achieving what is
effectively the largest semantic electronic database in the world. Moreover,
this database is available for all by using any search engine that can return
aggregate page-count estimates for a large range of search-queries. In the
paper introducing the NWD it was called `normalized Google distance (NGD),' but
since Google doesn't allow computer searches anymore, we opt for the more
neutral and descriptive NWD. web distance (NWD) method to determine similarity
between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural
Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau
Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN
978-142008592
- âŠ