3,686 research outputs found
Heuristic Approaches for Generating Local Process Models through Log Projections
Local Process Model (LPM) discovery is focused on the mining of a set of
process models where each model describes the behavior represented in the event
log only partially, i.e. subsets of possible events are taken into account to
create so-called local process models. Often such smaller models provide
valuable insights into the behavior of the process, especially when no adequate
and comprehensible single overall process model exists that is able to describe
the traces of the process from start to end. The practical application of LPM
discovery is however hindered by computational issues in the case of logs with
many activities (problems may already occur when there are more than 17 unique
activities). In this paper, we explore three heuristics to discover subsets of
activities that lead to useful log projections with the goal of speeding up LPM
discovery considerably while still finding high-quality LPMs. We found that a
Markov clustering approach to create projection sets results in the largest
improvement of execution time, with discovered LPMs still being better than
with the use of randomly generated activity sets of the same size. Another
heuristic, based on log entropy, yields a more moderate speedup, but enables
the discovery of higher quality LPMs. The third heuristic, based on the
relative information gain, shows unstable performance: for some data sets the
speedup and LPM quality are higher than with the log entropy based method,
while for other data sets there is no speedup at all.Comment: paper accepted and to appear in the proceedings of the IEEE Symposium
on Computational Intelligence and Data Mining (CIDM), special session on
Process Mining, part of the Symposium Series on Computational Intelligence
(SSCI
A Machine Learning Enhanced Scheme for Intelligent Network Management
The versatile networking services bring about huge influence on daily living styles while the amount and diversity of services cause high complexity of network systems. The network scale and complexity grow with the increasing infrastructure apparatuses, networking function, networking slices, and underlying architecture evolution. The conventional way is manual administration to maintain the large and complex platform, which makes effective and insightful management troublesome. A feasible and promising scheme is to extract insightful information from largely produced network data. The goal of this thesis is to use learning-based algorithms inspired by machine learning communities to discover valuable knowledge from substantial network data, which directly promotes intelligent management and maintenance. In the thesis, the management and maintenance focus on two schemes: network anomalies detection and root causes localization; critical traffic resource control and optimization. Firstly, the abundant network data wrap up informative messages but its heterogeneity and perplexity make diagnosis challenging. For unstructured logs, abstract and formatted log templates are extracted to regulate log records. An in-depth analysis framework based on heterogeneous data is proposed in order to detect the occurrence of faults and anomalies. It employs representation learning methods to map unstructured data into numerical features, and fuses the extracted feature for network anomaly and fault detection. The representation learning makes use of word2vec-based embedding technologies for semantic expression. Next, the fault and anomaly detection solely unveils the occurrence of events while failing to figure out the root causes for useful administration so that the fault localization opens a gate to narrow down the source of systematic anomalies. The extracted features are formed as the anomaly degree coupled with an importance ranking method to highlight the locations of anomalies in network systems. Two types of ranking modes are instantiated by PageRank and operation errors for jointly highlighting latent issue of locations. Besides the fault and anomaly detection, network traffic engineering deals with network communication and computation resource to optimize data traffic transferring efficiency. Especially when network traffic are constrained with communication conditions, a pro-active path planning scheme is helpful for efficient traffic controlling actions. Then a learning-based traffic planning algorithm is proposed based on sequence-to-sequence model to discover hidden reasonable paths from abundant traffic history data over the Software Defined Network architecture. Finally, traffic engineering merely based on empirical data is likely to result in stale and sub-optimal solutions, even ending up with worse situations. A resilient mechanism is required to adapt network flows based on context into a dynamic environment. Thus, a reinforcement learning-based scheme is put forward for dynamic data forwarding considering network resource status, which explicitly presents a promising performance improvement. In the end, the proposed anomaly processing framework strengthens the analysis and diagnosis for network system administrators through synthesized fault detection and root cause localization. The learning-based traffic engineering stimulates networking flow management via experienced data and further shows a promising direction of flexible traffic adjustment for ever-changing environments
DeCaf: Diagnosing and Triaging Performance Issues in Large-Scale Cloud Services
Large scale cloud services use Key Performance Indicators (KPIs) for tracking
and monitoring performance. They usually have Service Level Objectives (SLOs)
baked into the customer agreements which are tied to these KPIs. Dependency
failures, code bugs, infrastructure failures, and other problems can cause
performance regressions. It is critical to minimize the time and manual effort
in diagnosing and triaging such issues to reduce customer impact. Large volume
of logs and mixed type of attributes (categorical, continuous) in the logs
makes diagnosis of regressions non-trivial.
In this paper, we present the design, implementation and experience from
building and deploying DeCaf, a system for automated diagnosis and triaging of
KPI issues using service logs. It uses machine learning along with pattern
mining to help service owners automatically root cause and triage performance
issues. We present the learnings and results from case studies on two large
scale cloud services in Microsoft where DeCaf successfully diagnosed 10 known
and 31 unknown issues. DeCaf also automatically triages the identified issues
by leveraging historical data. Our key insights are that for any such diagnosis
tool to be effective in practice, it should a) scale to large volumes of
service logs and attributes, b) support different types of KPIs and ranking
functions, c) be integrated into the DevOps processes.Comment: To be published in the proceedings of ICSE-SEIP '20, Seoul, Republic
of Kore
Enabling Richer Insight Into Runtime Executions Of Systems
Systems software of very large scales are being heavily used today in various important scenarios such as online retail, banking, content services, web search and social networks. As the scale of functionality and complexity grows in these software, managing the implementations becomes a considerable challenge for developers, designers and maintainers. Software needs to be constantly monitored and tuned for optimal efficiency and user satisfaction. With large scale, these systems incorporate significant degrees of asynchrony, parallelism and distributed executions, reducing the manageability of software including performance management. Adding to the complexity, developers are under pressure between developing new functionality for customers and maintaining existing programs. This dissertation argues that the manual effort currently required to manage performance of these systems is very high, and can be automated to both reduce the likelihood of problems and quickly fix them once identified. The execution logs from these systems are easily available and provide rich information about the internals at runtime for diagnosis purposes, but the volume of logs is simply too large for today\u27s techniques. Developers hence spend many human hours observing and investigating executions of their systems during development and diagnosis of software, for performance management. This dissertation proposes the application of machine learning techniques to automatically analyze logs from executions, to challenging tasks in different phases of the software lifecycle. It is shown that the careful application of statistical techniques to features extracted from instrumentation, can distill the rich log data into easily comprehensible forms for the developers
AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
Artificial Intelligence for IT operations (AIOps) aims to combine the power
of AI with the big data generated by IT Operations processes, particularly in
cloud infrastructures, to provide actionable insights with the primary goal of
maximizing availability. There are a wide variety of problems to address, and
multiple use-cases, where AI capabilities can be leveraged to enhance
operational efficiency. Here we provide a review of the AIOps vision, trends
challenges and opportunities, specifically focusing on the underlying AI
techniques. We discuss in depth the key types of data emitted by IT Operations
activities, the scale and challenges in analyzing them, and where they can be
helpful. We categorize the key AIOps tasks as - incident detection, failure
prediction, root cause analysis and automated actions. We discuss the problem
formulation for each task, and then present a taxonomy of techniques to solve
these problems. We also identify relatively under explored topics, especially
those that could significantly benefit from advances in AI literature. We also
provide insights into the trends in this field, and what are the key investment
opportunities
- …