2,342 research outputs found
Efficient mining of discriminative molecular fragments
Frequent pattern discovery in structured data is receiving
an increasing attention in many application areas of sciences. However, the computational complexity and the large amount of data to be explored often make the sequential algorithms unsuitable. In this context high performance distributed computing becomes a very interesting and promising approach. In this paper we present a parallel formulation of the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The application is characterized by a highly irregular tree-structured computation. No estimation is available for task workloads, which show a power-law distribution in a wide range. The proposed approach allows dynamic resource aggregation and provides fault and latency tolerance. These features make the distributed application suitable for multi-domain heterogeneous environments, such as computational Grids. The distributed application has been evaluated on the well known National Cancer Instituteās HIV-screening dataset
High performance subgraph mining in molecular compounds
Structured data represented in the form of graphs arises in
several fields of the science and the growing amount of available data makes distributed graph mining techniques particularly relevant. In this paper, we present a distributed approach to the frequent subgraph mining
problem to discover interesting patterns in molecular compounds. The problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main
aspects of the proposed distributed algorithm, namely a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiver-initiated, load balancing
algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Instituteās HIV-screening dataset, where the approach attains close-to linear speedup in a network
of workstations
Active Learning of Discriminative Subgraph Patterns for API Misuse Detection
A common cause of bugs and vulnerabilities are the violations of usage
constraints associated with Application Programming Interfaces (APIs). API
misuses are common in software projects, and while there have been techniques
proposed to detect such misuses, studies have shown that they fail to reliably
detect misuses while reporting many false positives. One limitation of prior
work is the inability to reliably identify correct patterns of usage. Many
approaches confuse a usage pattern's frequency for correctness. Due to the
variety of alternative usage patterns that may be uncommon but correct, anomaly
detection-based techniques have limited success in identifying misuses. We
address these challenges and propose ALP (Actively Learned Patterns),
reformulating API misuse detection as a classification problem. After
representing programs as graphs, ALP mines discriminative subgraphs. While
still incorporating frequency information, through limited human supervision,
we reduce the reliance on the assumption relating frequency and correctness.
The principles of active learning are incorporated to shift human attention
away from the most frequent patterns. Instead, ALP samples informative and
representative examples while minimizing labeling effort. In our empirical
evaluation, ALP substantially outperforms prior approaches on both MUBench, an
API Misuse benchmark, and a new dataset that we constructed from real-world
software projects
CONTEXT-AWARE DEBUGGING FOR CONCURRENT PROGRAMS
Concurrency faults are difficult to reproduce and localize because they usually occur under specific inputs and thread interleavings. Most existing fault localization techniques focus on sequential programs but fail to identify faulty memory access patterns across threads, which are usually the root causes of concurrency faults. Moreover, existing techniques for sequential programs cannot be adapted to identify faulty paths in concurrent programs. While concurrency fault localization techniques have been proposed to analyze passing and failing executions obtained from running a set of test cases to identify faulty access patterns, they primarily focus on using statistical analysis. We present a novel approach to fault localization using feature selection techniques from machine learning. Our insight is that the concurrency access patterns obtained from a large volume of coverage data generally constitute high dimensional data sets, yet existing statistical analysis techniques for fault localization are usually applied to low dimensional data sets. Each additional failing or passing run can provide more diverse information, which can help localize faulty concurrency access patterns in code. The patterns with maximum feature diversity information can point to the most suspicious pattern. We then apply data mining technique and identify the interleaving patterns that are occurred most frequently and provide the possible faulty paths. We also evaluate the effectiveness of fault localization using test suites generated from different test adequacy criteria. We have evaluated Cadeco on 10 real-world multi-threaded Java applications. Results indicate that Cadeco outperforms state-of-the-art approaches for localizing concurrency faults
Dynamic load balancing for the distributed mining of molecular structures
In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of
methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the
past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially
render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to
discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no
reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic
partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated
load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer
Instituteās HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed
approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable
for large-scale, multi-domain, heterogeneous environments, such as computational grids
Feature-based time-series analysis
This work presents an introduction to feature-based time-series analysis. The
time series as a data type is first described, along with an overview of the
interdisciplinary time-series analysis literature. I then summarize the range
of feature-based representations for time series that have been developed to
aid interpretable insights into time-series structure. Particular emphasis is
given to emerging research that facilitates wide comparison of feature-based
representations that allow us to understand the properties of a time-series
dataset that make it suited to a particular feature-based representation or
analysis algorithm. The future of time-series analysis is likely to embrace
approaches that exploit machine learning methods to partially automate human
learning to aid understanding of the complex dynamical patterns in the time
series we measure from the world.Comment: 28 pages, 9 figure
Mining behavior graphs for ābacktraceā of noncrashing bugs
Analyzing the executions of a buggy software program is essentially a data mining process. Although many interesting methods have been developed to trace crashing bugs (such as memory violation and core dumps), it is still difficult to analyze noncrashing bugs (such as logical errors). In this paper, we develop a novel method to classify the structured traces of program executions using software behavior graphs. By analyzing the correct and incorrect executions, we have made good progress at the isolation of program regions that may lead to the faulty executions. The classification framework is built on an integration of closed graph mining and SVM classification. More interestingly, suspicious regions are identified through the capture of the classification accuracy change, which is measured incrementally during program execution. Our performance study and case-based experiments show that our approach is both effective and efficient
Heterogeneous Anomaly Detection for Software Systems via Semi-supervised Cross-modal Attention
Prompt and accurate detection of system anomalies is essential to ensure the
reliability of software systems. Unlike manual efforts that exploit all
available run-time information, existing approaches usually leverage only a
single type of monitoring data (often logs or metrics) or fail to make
effective use of the joint information among different types of data.
Consequently, many false predictions occur. To better understand the
manifestations of system anomalies, we conduct a systematical study on a large
amount of heterogeneous data, i.e., logs and metrics. Our study demonstrates
that logs and metrics can manifest system anomalies collaboratively and
complementarily, and neither of them only is sufficient. Thus, integrating
heterogeneous data can help recover the complete picture of a system's health
status. In this context, we propose Hades, the first end-to-end semi-supervised
approach to effectively identify system anomalies based on heterogeneous data.
Our approach employs a hierarchical architecture to learn a global
representation of the system status by fusing log semantics and metric
patterns. It captures discriminative features and meaningful interactions from
heterogeneous data via a cross-modal attention module, trained in a
semi-supervised manner. We evaluate Hades extensively on large-scale simulated
data and datasets from Huawei Cloud. The experimental results present the
effectiveness of our model in detecting system anomalies. We also release the
code and the annotated dataset for replication and future research.Comment: In Proceedings of the 2023 IEEE/ACM 45th International Conference on
Software Engineering (ICSE). arXiv admin note: substantial text overlap with
arXiv:2207.0291
- ā¦