Search CORE

68 research outputs found

Towards Unsupervised Deep Learning Based Anomaly Detection

Author: Gunther Jacob
Landeen Trevor
Publication venue: DigitalCommons@USU
Publication date: 08/05/2017
Field of study

Novelty or anomaly detection is a challenging problem in many research disciplines without a general solution. In machine learning, inputs unlike the training data need to be identified. In areas where research involves taking measurements, identifying errant measurements is often necessary and occasionally vital. When monitoring the status of a system, some observations may indicate a potential system failure is occurring or may occur in the near future. The challenge is to identify the anomalous measurements that are usually sparse in comparison to the valid measurements. This paper presents a land-water classification problem as an anomaly detection problem to demonstrate the inability of a classifier to detect anomalies. A second problem requiring the identification of anomalous data uses a deep neural network (DNN) to perform a nonlinear regression as a method for the estimation of the probability that a given input is valid and not anomalous. A discussion of autoencoders is then proposed as an alternative to the supervised classification and regression approaches in an effort to remove the necessity of representing the anomalies in the training dataset

DigitalCommons@USU

Restricted Generative Projection for One-Class Classification and Anomaly Detection

Author: Fan Jicong
Sun Ruoyu
Xiao Feng
Publication venue
Publication date: 09/07/2023
Field of study

We present a simple framework for one-class classification and anomaly detection. The core idea is to learn a mapping to transform the unknown distribution of training (normal) data to a known target distribution. Crucially, the target distribution should be sufficiently simple, compact, and informative. The simplicity is to ensure that we can sample from the distribution easily, the compactness is to ensure that the decision boundary between normal data and abnormal data is clear and reliable, and the informativeness is to ensure that the transformed data preserve the important information of the original data. Therefore, we propose to use truncated Gaussian, uniform in hypersphere, uniform on hypersphere, or uniform between hyperspheres, as the target distribution. We then minimize the distance between the transformed data distribution and the target distribution while keeping the reconstruction error for the original data small enough. Comparative studies on multiple benchmark datasets verify the effectiveness of our methods in comparison to baselines

arXiv.org e-Print Archive

A reinforcement learning recommender system using bi-clustering and Markov Decision Process

Author: Alahmari S. A.
Alahmari S. A.
Ayub M.
Ayub M.
Ghazanfar M. A.
Ghazanfar M. A.
Iftikhar A.
Iftikhar A.
Qazi N.
Qazi N.
Wall J.
Wall J.
Publication venue: Elsevier
Publication date: 01/01/2024
Field of study

Collaborative filtering (CF) recommender systems are static in nature and does not adapt well with changing user preferences. User preferences may change after interaction with a system or after buying a product. Conventional CF clustering algorithms only identifies the distribution of patterns and hidden correlations globally. However, the impossibility of discovering local patterns by these algorithms, headed to the popularization of bi-clustering algorithms. Bi-clustering algorithms can analyze all dataset dimensions simultaneously and consequently, discover local patterns that deliver a better understanding of the underlying hidden correlations. In this paper, we modelled the recommendation problem as a sequential decision-making problem using Markov Decision Processes (MDP). To perform state representation for MDP, we first converted user-item votings matrix to a binary matrix. Then we performed bi-clustering on this binary matrix to determine a subset of similar rows and columns. A bi-cluster merging algorithm is designed to merge similar and overlapping bi-clusters. These bi-clusters are then mapped to a squared grid (SG). RL is applied on this SG to determine best policy to give recommendation to users. Start state is determined using Improved Triangle Similarity (ITR similarity measure. Reward function is computed as grid state overlapping in terms of users and items in current and prospective next state. A thorough comparative analysis was conducted, encompassing a diverse array of methodologies, including RL-based, pure Collaborative Filtering (CF), and clustering methods. The results demonstrate that our proposed method outperforms its competitors in terms of precision, recall, and optimal policy learning

UEL Research Repository at University of East London

454-Pyrosequencing: A Molecular Battiscope for Freshwater Viral Ecology

Author: Ackermann
Alan J. McCarthy
Darren L. Smith
David J. Rooks
Foster
Heather E. Allison
James E. McDonald
Margulies
Martin J. Woodward
Ryan
Sambrook
Warren
White
Wilson
Publication venue: 'MDPI AG'
Publication date: 01/01/2010
Field of study

Viruses, the most abundant biological entities on the planet, are capable of infecting organisms from all three branches of life, although the majority infect bacteria where the greatest degree of cellular diversity lies. However, the characterization and assessment of viral diversity in natural environments is only beginning to become a possibility. Through the development of a novel technique for the harvest of viral DNA and the application of 454 pyrosequencing, a snapshot of the diversity of the DNA viruses harvested from a standing pond on a cattle farm has been obtained. A high abundance of viral genotypes (785) were present within the virome. The absolute numbers of lambdoid and Shiga toxin (Stx) encoding phages detected suggested that the depth of sequencing had enabled recovery of only ca. 8% of the total virus population, numbers that agreed within less than an order of magnitude with predictions made by rarefaction analysis. The most abundant viral genotypes in the pond were bacteriophages (93.7%). The predominant viral genotypes infecting higher life forms found in association with the farm were pathogens that cause disease in cattle and humans, e.g. members of the Herpesviridae. The techniques and analysis described here provide a fresh approach to the monitoring of viral populations in the aquatic environment, with the potential to become integral to the development of risk analysis tools for monitoring the dissemination of viral agents of animal, plant and human diseases

Multidisciplinary Digital Publishing Institute

Central Archive at the University of Reading

Northumbria Research Link

Crossref

University of Birmingham Research Portal

Directory of Open Access Journals

PubMed Central

Distributed multi-label learning on Apache Spark

Author: Gonzalez Lopez Jorge
Publication venue: VCU Scholars Compass
Publication date: 01/01/2019
Field of study

This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art

VCU Scholars Compass