13,341 research outputs found
Distributed-based massive processing of activity logs for efficient user modeling in a Virtual Campus
This paper reports on a multi-fold approach for the building of user models based on the identification of navigation patterns in a virtual campus, allowing for adapting the campus’ usability to the actual learners’ needs, thus resulting in a great stimulation of the learning experience. However, user modeling in this context implies a constant processing and analysis of user interaction data during long-term learning activities, which produces huge amounts of valuable data stored typically in server log files. Due to the large or very large size of log files generated daily, the massive processing is a foremost step in extracting useful information. To this end, this work studies, first, the viability of processing large log data files of a real Virtual Campus using different distributed infrastructures. More precisely, we study the time performance of massive processing of daily log files implemented following the master-slave paradigm and evaluated using Cluster Computing and PlanetLab platforms. The study reveals the complexity and challenges of massive processing in the big data era, such as the need to carefully tune the log file processing in terms of chunk log data size to be processed at slave nodes as well as the bottleneck in processing in truly geographically distributed infrastructures due to the overhead caused by the communication time among the master and slave nodes. Then, an application of the massive processing approach resulting in log data processed and stored in a well-structured format is presented. We show how to extract knowledge from the log data analysis by using the WEKA framework for data mining purposes showing its usefulness to effectively build user models in terms of identifying interesting navigation patters of on-line learners. The study is motivated and conducted in the context of the actual data logs of the Virtual Campus of the Open University of Catalonia.Peer ReviewedPostprint (author's final draft
Detection of Early-Stage Enterprise Infection by Mining Large-Scale Log Data
Recent years have seen the rise of more sophisticated attacks including
advanced persistent threats (APTs) which pose severe risks to organizations and
governments by targeting confidential proprietary information. Additionally,
new malware strains are appearing at a higher rate than ever before. Since many
of these malware are designed to evade existing security products, traditional
defenses deployed by most enterprises today, e.g., anti-virus, firewalls,
intrusion detection systems, often fail at detecting infections at an early
stage.
We address the problem of detecting early-stage infection in an enterprise
setting by proposing a new framework based on belief propagation inspired from
graph theory. Belief propagation can be used either with "seeds" of compromised
hosts or malicious domains (provided by the enterprise security operation
center -- SOC) or without any seeds. In the latter case we develop a detector
of C&C communication particularly tailored to enterprises which can detect a
stealthy compromise of only a single host communicating with the C&C server.
We demonstrate that our techniques perform well on detecting enterprise
infections. We achieve high accuracy with low false detection and false
negative rates on two months of anonymized DNS logs released by Los Alamos
National Lab (LANL), which include APT infection attacks simulated by LANL
domain experts. We also apply our algorithms to 38TB of real-world web proxy
logs collected at the border of a large enterprise. Through careful manual
investigation in collaboration with the enterprise SOC, we show that our
techniques identified hundreds of malicious domains overlooked by
state-of-the-art security products
Bidirectional Growth based Mining and Cyclic Behaviour Analysis of Web Sequential Patterns
Web sequential patterns are important for analyzing and understanding users
behaviour to improve the quality of service offered by the World Wide Web. Web
Prefetching is one such technique that utilizes prefetching rules derived
through Cyclic Model Analysis of the mined Web sequential patterns. The more
accurate the prediction and more satisfying the results of prefetching if we
use a highly efficient and scalable mining technique such as the Bidirectional
Growth based Directed Acyclic Graph. In this paper, we propose a novel
algorithm called Bidirectional Growth based mining Cyclic behavior Analysis of
web sequential Patterns (BGCAP) that effectively combines these strategies to
generate prefetching rules in the form of 2-sequence patterns with Periodicity
and threshold of Cyclic Behaviour that can be utilized to effectively prefetch
Web pages, thus reducing the users perceived latency. As BGCAP is based on
Bidirectional pattern growth, it performs only (log n+1) levels of recursion
for mining n Web sequential patterns. Our experimental results show that
prefetching rules generated using BGCAP is 5-10 percent faster for different
data sizes and 10-15% faster for a fixed data size than TD-Mine. In addition,
BGCAP generates about 5-15 percent more prefetching rules than TD-Mine.Comment: 19 page
Discovering Beaten Paths in Collaborative Ontology-Engineering Projects using Markov Chains
Biomedical taxonomies, thesauri and ontologies in the form of the
International Classification of Diseases (ICD) as a taxonomy or the National
Cancer Institute Thesaurus as an OWL-based ontology, play a critical role in
acquiring, representing and processing information about human health. With
increasing adoption and relevance, biomedical ontologies have also
significantly increased in size. For example, the 11th revision of the ICD,
which is currently under active development by the WHO contains nearly 50,000
classes representing a vast variety of different diseases and causes of death.
This evolution in terms of size was accompanied by an evolution in the way
ontologies are engineered. Because no single individual has the expertise to
develop such large-scale ontologies, ontology-engineering projects have evolved
from small-scale efforts involving just a few domain experts to large-scale
projects that require effective collaboration between dozens or even hundreds
of experts, practitioners and other stakeholders. Understanding how these
stakeholders collaborate will enable us to improve editing environments that
support such collaborations. We uncover how large ontology-engineering
projects, such as the ICD in its 11th revision, unfold by analyzing usage logs
of five different biomedical ontology-engineering projects of varying sizes and
scopes using Markov chains. We discover intriguing interaction patterns (e.g.,
which properties users subsequently change) that suggest that large
collaborative ontology-engineering projects are governed by a few general
principles that determine and drive development. From our analysis, we identify
commonalities and differences between different projects that have implications
for project managers, ontology editors, developers and contributors working on
collaborative ontology-engineering projects and tools in the biomedical domain.Comment: Published in the Journal of Biomedical Informatic
Implementing a Decision-Aware System for Loan Contracting Decision Process
The paper introduces our work related to the design and implementation of a decision-aware system focused on the loan contracting decision process. A decision-aware system is a software that enables the user to make a decision in a simulated environment and logs all the actions of the decision maker while interacting with the software. By using a mining algorithm on the logs, it creates a model of the decision process and presents it to the user. The main design issue introduced in the paper is the possibility to log the mental actions of the user. The main implementation issues are: user activity logging programming and technologies used. The first section of the paper introduces the state-of-the-art research in process mining and the framework of our research; the second section argues the design of the system; the third section introduces the actual implementation and the fourth section shows a running example.Decision-Aware Systems, Decision Activity Logs, Decision Mining, Codeigniter, JSON
- …