271 research outputs found
A Survey on Explainable Anomaly Detection
In the past two decades, most research on anomaly detection has focused on
improving the accuracy of the detection, while largely ignoring the
explainability of the corresponding methods and thus leaving the explanation of
outcomes to practitioners. As anomaly detection algorithms are increasingly
used in safety-critical domains, providing explanations for the high-stakes
decisions made in those domains has become an ethical and regulatory
requirement. Therefore, this work provides a comprehensive and structured
survey on state-of-the-art explainable anomaly detection techniques. We propose
a taxonomy based on the main aspects that characterize each explainable anomaly
detection technique, aiming to help practitioners and researchers find the
explainable anomaly detection method that best suits their needs.Comment: Paper accepted by the ACM Transactions on Knowledge Discovery from
Data (TKDD) for publication (preprint version
Big data analytics for preventive medicine
© 2019, Springer-Verlag London Ltd., part of Springer Nature. Medical data is one of the most rewarding and yet most complicated data to analyze. How can healthcare providers use modern data analytics tools and technologies to analyze and create value from complex data? Data analytics, with its promise to efficiently discover valuable pattern by analyzing large amount of unstructured, heterogeneous, non-standard and incomplete healthcare data. It does not only forecast but also helps in decision making and is increasingly noticed as breakthrough in ongoing advancement with the goal is to improve the quality of patient care and reduces the healthcare cost. The aim of this study is to provide a comprehensive and structured overview of extensive research on the advancement of data analytics methods for disease prevention. This review first introduces disease prevention and its challenges followed by traditional prevention methodologies. We summarize state-of-the-art data analytics algorithms used for classification of disease, clustering (unusually high incidence of a particular disease), anomalies detection (detection of disease) and association as well as their respective advantages, drawbacks and guidelines for selection of specific model followed by discussion on recent development and successful application of disease prevention methods. The article concludes with open research challenges and recommendations
A survey on explainable anomaly detection
NWOAlgorithms and the Foundations of Software technolog
New scalable machine learning methods: beyond classification and regression
Programa Oficial de Doutoramento en Computación . 5009V01[Abstract]
The recent surge in data available has spawned a new and promising age of machine
learning. Success cases of machine learning are arriving at an increasing rate as some
algorithms are able to leverage immense amounts of data to produce great complicated
predictions. Still, many algorithms in the toolbox of the machine learning practitioner
have been render useless in this new scenario due to the complications associated with
large-scale learning. Handling large datasets entails logistical problems, limits the computational
and spatial complexity of the used algorithms, favours methods with few or
no hyperparameters to be con gured and exhibits speci c characteristics that complicate
learning. This thesis is centered on the scalability of machine learning algorithms,
that is, their capacity to maintain their e ectivity as the scale of the data grows, and
how it can be improved. We focus on problems for which the existing solutions struggle
when the scale grows. Therefore, we skip classi cation and regression problems and
focus on feature selection, anomaly detection, graph construction and explainable machine
learning. We analyze four di erent strategies to obtain scalable algorithms. First,
we explore distributed computation, which is used in all of the presented algorithms.
Besides this technique, we also examine the use of approximate models to speed up
computations, the design of new models that take advantage of a characteristic of the
input data to simplify training and the enhancement of simple models to enable them
to manage large-scale learning. We have implemented four new algorithms and six
versions of existing ones that tackle the mentioned problems and for each one we report
experimental results that show both their validity in comparison with competing
methods and their capacity to scale to large datasets. All the presented algorithms
have been made available for download and are being published in journals to enable
practitioners and researchers to use them.[Resumen]
El reciente aumento de la cantidad de datos disponibles ha dado lugar a una nueva y
prometedora era del aprendizaje máquina. Los éxitos en este campo se están sucediendo
a un ritmo cada vez mayor gracias a la capacidad de algunos algoritmos de aprovechar
inmensas cantidades de datos para producir predicciones difÃciles y muy certeras. Sin
embargo, muchos de los algoritmos hasta ahora disponibles para los cientÃficos de datos
han perdido su efectividad en este nuevo escenario debido a las complicaciones asociadas
al aprendizaje a gran escala. Trabajar con grandes conjuntos de datos conlleva
problemas logÃsticos, limita la complejidad computacional y espacial de los algoritmos
utilizados, favorece los métodos con pocos o ningún hiperparámetro a configurar y
muestra complicaciones especÃficas que dificultan el aprendizaje. Esta tesis se centra en
la escalabilidad de los algoritmos de aprendizaje máquina, es decir, en su capacidad de
mantener su efectividad a medida que la escala del conjunto de datos aumenta. Ponemos
el foco en problemas cuyas soluciones actuales tienen problemas al aumentar la
escala. Por tanto, obviando la clasificación y la regresión, nos centramos en la selección
de caracterÃsticas, detección de anomalÃas, construcción de grafos y en el aprendizaje
máquina explicable. Analizamos cuatro estrategias diferentes para obtener algoritmos
escalables. En primer lugar, exploramos la computación distribuida, que es utilizada en
todos los algoritmos presentados. Además de esta técnica, también examinamos el uso
de modelos aproximados para acelerar los cálculos, el dise~no de modelos que aprovechan
una particularidad de los datos de entrada para simplificar el entrenamiento y la
potenciación de modelos simples para adecuarlos al aprendizaje a gran escala. Hemos
implementado cuatro nuevos algoritmos y seis versiones de algoritmos existentes que
tratan los problemas mencionados y para cada uno de ellos detallamos resultados experimentales
que muestran tanto su validez en comparación con los métodos previamente
disponibles como su capacidad para escalar a grandes conjuntos de datos. Todos los algoritmos presentados han sido puestos a disposición del lector para su descarga y
se han difundido mediante publicaciones en revistas cientÃficas para facilitar que tanto
investigadores como cientÃficos de datos puedan conocerlos y utilizarlos.[Resumo]
O recente aumento na cantidade de datos dispo~nibles deu lugar a unha nova e prometedora
era no aprendizaxe máquina. Os éxitos neste eido estanse a suceder a un
ritmo cada vez maior gracias a capacidade dalgúns algoritmos de aproveitar inmensas
cantidades de datos para producir prediccións difÃciles e moi acertadas. Non obstante,
moitos dos algoritmos ata agora dispo~nibles para os cientÃficos de datos perderon a súa
efectividade neste novo escenario por mor das complicacións asociadas ao aprendizaxe
a grande escala. Traballar con grandes conxuntos de datos leva consigo problemas
loxÃsticos, limita a complexidade computacional e espacial dos algoritmos empregados,
favorece os métodos con poucos ou ningún hiperparámetro a configurar e ten complicacións especÃficas que dificultan o aprendizaxe. Esta tese céntrase na escalabilidade dos
algoritmos de aprendizaxe máquina, é dicir, na súa capacidade de manter a súa efectividade
a medida que a escala do conxunto de datos aumenta. Tratamos problemas para
os que as solucións dispoñibles teñen problemas cando crece a escala. Polo tanto, deixando
no canto a clasificación e a regresión, centrámonos na selección de caracterÃsticas,
detección de anomalÃas, construcción de grafos e no aprendizaxe máquina explicable.
Analizamos catro estratexias diferentes para obter algoritmos escalables. En primeiro
lugar, exploramos a computación distribuÃda, que empregamos en tódolos algoritmos
presentados. Ademáis desta técnica, tamén examinamos o uso de modelos aproximados
para acelerar os cálculos, o deseño de modelos que aproveitan unha particularidade dos
datos de entrada para simplificar o adestramento e a potenciación de modelos sinxelos
para axeitalos ao aprendizaxe a gran escala. Implementamos catro novos algoritmos e
seis versións de algoritmos existentes que tratan os problemas mencionados e para cada
un deles expoñemos resultados experimentais que mostran tanto a súa validez en comparación cos métodos previamente dispoñibles como a súa capacidade para escalar a
grandes conxuntos de datos. Tódolos algoritmos presentados foron postos a disposición
do lector para a súa descarga e difundÃronse mediante publicacións en revistas cientÃficas para facilitar que tanto investigadores como cientÃficos de datos poidan coñecelos e
empregalos
Proceedings of the 1st Doctoral Consortium at the European Conference on Artificial Intelligence (DC-ECAI 2020)
1st Doctoral Consortium at the European Conference on
Artificial Intelligence (DC-ECAI 2020), 29-30 August, 2020
Santiago de Compostela, SpainThe DC-ECAI 2020 provides a unique opportunity for PhD students, who are close to finishing their doctorate research, to interact with experienced researchers in the field. Senior members of the community are assigned as mentors for each group of students based on the student’s research or similarity of research interests. The DC-ECAI 2020, which is held virtually this year, allows students from all over the world to present their research and discuss their ongoing research and career plans with their mentor, to do networking with other participants, and to receive training and mentoring about career planning and career option
End-to-end anomaly detection in stream data
Nowadays, huge volumes of data are generated with increasing velocity through various systems, applications, and activities. This increases the demand for stream and time series analysis to react to changing conditions in real-time for enhanced efficiency and quality of service delivery as well as upgraded safety and security in private and public sectors. Despite its very rich history, time series anomaly detection is still one of the vital topics in machine learning research and is receiving increasing attention. Identifying hidden patterns and selecting an appropriate model that fits the observed data well and also carries over to unobserved data is not a trivial task. Due to the increasing diversity of data sources and associated stochastic processes, this pivotal data analysis topic is loaded with various challenges like complex latent patterns, concept drift, and overfitting that may mislead the model and cause a high false alarm rate. Handling these challenges leads the advanced anomaly detection methods to develop sophisticated decision logic, which turns them into mysterious and inexplicable black-boxes. Contrary to this trend, end-users expect transparency and verifiability to trust a model and the outcomes it produces. Also, pointing the users to the most anomalous/malicious areas of time series and causal features could save them time, energy, and money. For the mentioned reasons, this thesis is addressing the crucial challenges in an end-to-end pipeline of stream-based anomaly detection through the three essential phases of behavior prediction, inference, and interpretation. The first step is focused on devising a time series model that leads to high average accuracy as well as small error deviation. On this basis, we propose higher-quality anomaly detection and scoring techniques that utilize the related contexts to reclassify the observations and post-pruning the unjustified events. Last but not least, we make the predictive process transparent and verifiable by providing meaningful reasoning behind its generated results based on the understandable concepts by a human. The provided insight can pinpoint the anomalous regions of time series and explain why the current status of a system has been flagged as anomalous. Stream-based anomaly detection research is a principal area of innovation to support our economy, security, and even the safety and health of societies worldwide. We believe our proposed analysis techniques can contribute to building a situational awareness platform and open new perspectives in a variety of domains like cybersecurity, and health
Anomaly detection and explanation in big data
2021 Spring.Includes bibliographical references.Data quality tests are used to validate the data stored in databases and data warehouses, and to detect violations of syntactic and semantic constraints. Domain experts grapple with the issues related to the capturing of all the important constraints and checking that they are satisfied. The constraints are often identified in an ad hoc manner based on the knowledge of the application domain and the needs of the stakeholders. Constraints can exist over single or multiple attributes as well as records involving time series and sequences. The constraints involving multiple attributes can involve both linear and non-linear relationships among the attributes. We propose ADQuaTe as a data quality test framework that automatically (1) discovers different types of constraints from the data, (2) marks records that violate the constraints as suspicious, and (3) explains the violations. Domain knowledge is required to determine whether or not the suspicious records are actually faulty. The framework can incorporate feedback from domain experts to improve the accuracy of constraint discovery and anomaly detection. We instantiate ADQuaTe in two ways to detect anomalies in non-sequence and sequence data. The first instantiation (ADQuaTe2) uses an unsupervised approach called autoencoder for constraint discovery in non-sequence data. ADQuaTe2 is based on analyzing records in isolation to discover constraints among the attributes. We evaluate the effectiveness of ADQuaTe2 using real-world non-sequence datasets from the human health and plant diagnosis domains. We demonstrate that ADQuaTe2 can discover new constraints that were previously unspecified in existing data quality tests, and can report both previously detected and new faults in the data. We also use non-sequence datasets from the UCI repository to evaluate the improvement in the accuracy of ADQuaTe2 after incorporating ground truth knowledge and retraining the autoencoder model. The second instantiation (IDEAL) uses an unsupervised LSTM-autoencoder for constraint discovery in sequence data. IDEAL analyzes the correlations and dependencies among data records to discover constraints. We evaluate the effectiveness of IDEAL using datasets from Yahoo servers, NASA Shuttle, and Colorado State University Energy Institute. We demonstrate that IDEAL can detect previously known anomalies from these datasets. Using mutation analysis, we show that IDEAL can detect different types of injected faults. We also demonstrate that the accuracy of the approach improves after incorporating ground truth knowledge about the injected faults and retraining the LSTM-Autoencoder model. The novelty of this research lies in the development of a domain-independent framework that effectively and efficiently discovers different types of constraints from the data, detects and explains anomalous data, and minimizes false alarms through an interactive learning process
Unsupervised learning for anomaly detection in Australian medical payment data
Fraudulent or wasteful medical insurance claims made by health care providers are costly for insurers. Typically, OECD healthcare organisations lose 3-8% of total expenditure due to fraud. As Australia’s universal public health insurer, Medicare Australia, spends approximately A1–2.7 billion could be expected.However, fewer than 1% of claims to Medicare Australia are detected as fraudulent, below international benchmarks.
Variation is common in medicine, and health conditions, along with their presentation and treatment, are heterogenous by nature. Increasing volumes of data and rapidly changing patterns bring challenges which require novel solutions. Machine learning and data mining are becoming commonplace in this field, but no gold standard is yet available.
In this project, requirements are developed for real-world application to compliance analytics at the Australian Government Department of Health and Aged Care (DoH), covering: unsupervised learning; problem generalisation; human interpretability; context discovery; and cost prediction. Three novel methods are presented which rank providers by potentially recoverable costs. These methods used association analysis, topic modelling, and sequential pattern mining to provide interpretable, expert-editable models of typical provider claims. Anomalous providers are identified through comparison to the typical models, using metrics based on costs of excess or upgraded services. Domain knowledge is incorporated in a machine-friendly way in two of the methods through the use of the MBS as an ontology. Validation by subject-matter experts and comparison to existing techniques shows that the methods perform well. The methods are implemented in a software framework which enables rapid prototyping and quality assurance. The code is implemented at the DoH, and further applications as decision-support systems are in progress. The developed requirements will apply to future work in this fiel
Bioinformatics Applications Based On Machine Learning
The great advances in information technology (IT) have implications for many sectors, such as bioinformatics, and has considerably increased their possibilities. This book presents a collection of 11 original research papers, all of them related to the application of IT-related techniques within the bioinformatics sector: from new applications created from the adaptation and application of existing techniques to the creation of new methodologies to solve existing problems
A Machine Learning Approach to Indoor Localization Data Mining
Indoor positioning systems are increasingly commonplace in various environments and
produce large quantities of data. They are used in industrial applications, robotics,
asset and employee tracking just to name a few use cases. The growing amount of data
and the accelerating progress of machine learning opens up many new possibilities for
analyzing this data in ways that were not conceivable or relevant before. This paper
introduces connected concepts and implementations to answer question how this data
can be utilized. Data gathered in this thesis originates from an indoor positioning system
deployed in retail environment, but the discussed methods can be applied generally.
The issue will be approached by first introducing the concept of machine learning
and more generally, artificial intelligence, and how they work on a general level. A
deeper dive is done to subfields and algorithms that are relevant to the data mining task
at hand. Indoor positioning system basics are also shortly discussed to create a base understanding
on the realistic capabilities and constraints that these kinds of systems encase.
These methods and previous knowledge from literature are put to test with the
freshly gathered data. An algorithm based on existing example from literature was tested
and improved upon with the new data. A novel method to cluster and classify movement
patterns was introduced, utilizing deep learning to create embedded representations of the
trajectories in a more complex learning pipeline. This type of learning is often referred
to as deep clustering.
The results are promising and both of the methods produce useful high level representations
of the complex dataset that can help a human operator to discern the
relevant patterns from raw data and to be used as an input for subsequent supervised and
unsupervised learning steps. Several factors related to optimizing the learning pipeline,
such as regularization were also researched and the results presented as visualizations.
The research found that pipeline consisting of CNN-autoencoder followed by a classic
clustering algorithm such as DBSCAN produces useful results in the form of trajectory
clusters. Regularization such as L1 regression improves this performance.
The research done in this paper presents useful algorithms for processing raw, noisy
localization data from indoor environments that can be used for further implementations
in both industrial applications and academia
- …