3,513 research outputs found
Evaluation of Trace Alignment Quality and its Application in Medical Process Mining
Trace alignment algorithms have been used in process mining for discovering
the consensus treatment procedures and process deviations. Different alignment
algorithms, however, may produce very different results. No widely-adopted
method exists for evaluating the results of trace alignment. Existing
reference-free evaluation methods cannot adequately and comprehensively assess
the alignment quality. We analyzed and compared the existing evaluation
methods, identifying their limitations, and introduced improvements in two
reference-free evaluation methods. Our approach assesses the alignment result
globally instead of locally, and therefore helps the algorithm to optimize
overall alignment quality. We also introduced a novel metric to measure the
alignment complexity, which can be used as a constraint on alignment algorithm
optimization. We tested our evaluation methods on a trauma resuscitation
dataset and provided the medical explanation of the activities and patterns
identified as deviations using our proposed evaluation methods.Comment: 10 pages, 6 figures and 5 table
Identification of hot regions in protein-protein interactions by sequential pattern mining
<p>Abstract</p> <p>Background</p> <p>Identification of protein interacting sites is an important task in computational molecular biology. As more and more protein sequences are deposited without available structural information, it is strongly desirable to predict protein binding regions by their sequences alone. This paper presents a pattern mining approach to tackle this problem. It is observed that a functional region of protein structures usually consists of several peptide segments linked with large wildcard regions. Thus, the proposed mining technology considers large irregular gaps when growing patterns, in order to find the residues that are simultaneously conserved but largely separated on the sequences. A derived pattern is called a cluster-like pattern since the discovered conserved residues are always grouped into several blocks, which each corresponds to a local conserved region on the protein sequence.</p> <p>Results</p> <p>The experiments conducted in this work demonstrate that the derived long patterns automatically discover the important residues that form one or several hot regions of protein-protein interactions. The methodology is evaluated by conducting experiments on the web server MAGIIC-PRO based on a well known benchmark containing 220 protein chains from 72 distinct complexes. Among the tested 218 proteins, there are 900 sequential blocks discovered, 4.25 blocks per protein chain on average. About 92% of the derived blocks are observed to be clustered in space with at least one of the other blocks, and about 66% of the blocks are found to be near the interface of protein-protein interactions. It is summarized that for about 83% of the tested proteins, at least two interacting blocks can be discovered by this approach.</p> <p>Conclusion</p> <p>This work aims to demonstrate that the important residues associated with the interface of protein-protein interactions may be automatically discovered by sequential pattern mining. The detected regions possess high conservation and thus are considered as the computational hot regions. This information would be useful to characterizing protein sequences, predicting protein function, finding potential partners, and facilitating protein docking for drug discovery.</p
Pattern Discovery in Colored Strings
In this paper, we consider the problem of identifying patterns of interest in
colored strings. A colored string is a string where each position is assigned
one of a finite set of colors. Our task is to find substrings of the colored
string that always occur followed by the same color at the same distance. The
problem is motivated by applications in embedded systems verification, in
particular, assertion mining. The goal there is to automatically find
properties of the embedded system from the analysis of its simulation traces.
We show that, in our setting, the number of patterns of interest is
upper-bounded by , where is the length of the string. We
introduce a baseline algorithm, running in time, which
identifies all patterns of interest satisfying certain minimality conditions,
for all colors in the string. For the case where one is interested in patterns
related to one color only, we also provide a second algorithm which runs in
time in the worst case but is faster than the baseline
algorithm in practice. Both solutions use suffix trees, and the second
algorithm also uses an appropriately defined priority queue, which allows us to
reduce the number of computations. We performed an experimental evaluation of
the proposed approaches over both synthetic and real-world datasets, and found
that the second algorithm outperforms the first algorithm on all simulated
data, while on the real-world data, the performance varies between a slight
slowdown (on half of the datasets) and a speedup by a factor of up to 11.Comment: 22 pages, 5 figures, 2 tables, published in ACM Journal of
Experimental Algorithmics. This is the journal version of the paper with the
same title at SEA 2020 (18th Symposium on Experimental Algorithms, Catania,
Italy, June 16-18, 2020
Unraveling the "Anomaly" in Time Series Anomaly Detection: A Self-supervised Tri-domain Solution
The ongoing challenges in time series anomaly detection (TSAD), notably the
scarcity of anomaly labels and the variability in anomaly lengths and shapes,
have led to the need for a more efficient solution. As limited anomaly labels
hinder traditional supervised models in TSAD, various SOTA deep learning
techniques, such as self-supervised learning, have been introduced to tackle
this issue. However, they encounter difficulties handling variations in anomaly
lengths and shapes, limiting their adaptability to diverse anomalies.
Additionally, many benchmark datasets suffer from the problem of having
explicit anomalies that even random functions can detect. This problem is
exacerbated by ill-posed evaluation metrics, known as point adjustment (PA),
which can result in inflated model performance. In this context, we propose a
novel self-supervised learning based Tri-domain Anomaly Detector (TriAD), which
addresses these challenges by modeling features across three data domains -
temporal, frequency, and residual domains - without relying on anomaly labels.
Unlike traditional contrastive learning methods, TriAD employs both
inter-domain and intra-domain contrastive loss to learn common attributes among
normal data and differentiate them from anomalies. Additionally, our approach
can detect anomalies of varying lengths by integrating with a discord discovery
algorithm. It is worth noting that this study is the first to reevaluate the
deep learning potential in TSAD, utilizing both rigorously designed datasets
(i.e., UCR Archive) and evaluation metrics (i.e., PA%K and affiliation).
Through experimental results on the UCR dataset, TriAD achieves an impressive
three-fold increase in PA%K based F1 scores over SOTA deep learning models, and
50% increase of accuracy as compared to SOTA discord discovery algorithms.Comment: This work is submitted to IEEE International Conference on Data
Engineering (ICDE) 202
Unsupervised learning for anomaly detection in Australian medical payment data
Fraudulent or wasteful medical insurance claims made by health care providers are costly for insurers. Typically, OECD healthcare organisations lose 3-8% of total expenditure due to fraud. As Australia’s universal public health insurer, Medicare Australia, spends approximately A1–2.7 billion could be expected.However, fewer than 1% of claims to Medicare Australia are detected as fraudulent, below international benchmarks.
Variation is common in medicine, and health conditions, along with their presentation and treatment, are heterogenous by nature. Increasing volumes of data and rapidly changing patterns bring challenges which require novel solutions. Machine learning and data mining are becoming commonplace in this field, but no gold standard is yet available.
In this project, requirements are developed for real-world application to compliance analytics at the Australian Government Department of Health and Aged Care (DoH), covering: unsupervised learning; problem generalisation; human interpretability; context discovery; and cost prediction. Three novel methods are presented which rank providers by potentially recoverable costs. These methods used association analysis, topic modelling, and sequential pattern mining to provide interpretable, expert-editable models of typical provider claims. Anomalous providers are identified through comparison to the typical models, using metrics based on costs of excess or upgraded services. Domain knowledge is incorporated in a machine-friendly way in two of the methods through the use of the MBS as an ontology. Validation by subject-matter experts and comparison to existing techniques shows that the methods perform well. The methods are implemented in a software framework which enables rapid prototyping and quality assurance. The code is implemented at the DoH, and further applications as decision-support systems are in progress. The developed requirements will apply to future work in this fiel
- …