68 research outputs found

    Towards Unsupervised Deep Learning Based Anomaly Detection

    Get PDF
    Novelty or anomaly detection is a challenging problem in many research disciplines without a general solution. In machine learning, inputs unlike the training data need to be identified. In areas where research involves taking measurements, identifying errant measurements is often necessary and occasionally vital. When monitoring the status of a system, some observations may indicate a potential system failure is occurring or may occur in the near future. The challenge is to identify the anomalous measurements that are usually sparse in comparison to the valid measurements. This paper presents a land-water classification problem as an anomaly detection problem to demonstrate the inability of a classifier to detect anomalies. A second problem requiring the identification of anomalous data uses a deep neural network (DNN) to perform a nonlinear regression as a method for the estimation of the probability that a given input is valid and not anomalous. A discussion of autoencoders is then proposed as an alternative to the supervised classification and regression approaches in an effort to remove the necessity of representing the anomalies in the training dataset

    Restricted Generative Projection for One-Class Classification and Anomaly Detection

    Full text link
    We present a simple framework for one-class classification and anomaly detection. The core idea is to learn a mapping to transform the unknown distribution of training (normal) data to a known target distribution. Crucially, the target distribution should be sufficiently simple, compact, and informative. The simplicity is to ensure that we can sample from the distribution easily, the compactness is to ensure that the decision boundary between normal data and abnormal data is clear and reliable, and the informativeness is to ensure that the transformed data preserve the important information of the original data. Therefore, we propose to use truncated Gaussian, uniform in hypersphere, uniform on hypersphere, or uniform between hyperspheres, as the target distribution. We then minimize the distance between the transformed data distribution and the target distribution while keeping the reconstruction error for the original data small enough. Comparative studies on multiple benchmark datasets verify the effectiveness of our methods in comparison to baselines

    A reinforcement learning recommender system using bi-clustering and Markov Decision Process

    Get PDF
    Collaborative filtering (CF) recommender systems are static in nature and does not adapt well with changing user preferences. User preferences may change after interaction with a system or after buying a product. Conventional CF clustering algorithms only identifies the distribution of patterns and hidden correlations globally. However, the impossibility of discovering local patterns by these algorithms, headed to the popularization of bi-clustering algorithms. Bi-clustering algorithms can analyze all dataset dimensions simultaneously and consequently, discover local patterns that deliver a better understanding of the underlying hidden correlations. In this paper, we modelled the recommendation problem as a sequential decision-making problem using Markov Decision Processes (MDP). To perform state representation for MDP, we first converted user-item votings matrix to a binary matrix. Then we performed bi-clustering on this binary matrix to determine a subset of similar rows and columns. A bi-cluster merging algorithm is designed to merge similar and overlapping bi-clusters. These bi-clusters are then mapped to a squared grid (SG). RL is applied on this SG to determine best policy to give recommendation to users. Start state is determined using Improved Triangle Similarity (ITR similarity measure. Reward function is computed as grid state overlapping in terms of users and items in current and prospective next state. A thorough comparative analysis was conducted, encompassing a diverse array of methodologies, including RL-based, pure Collaborative Filtering (CF), and clustering methods. The results demonstrate that our proposed method outperforms its competitors in terms of precision, recall, and optimal policy learning

    454-Pyrosequencing: A Molecular Battiscope for Freshwater Viral Ecology

    Get PDF
    Viruses, the most abundant biological entities on the planet, are capable of infecting organisms from all three branches of life, although the majority infect bacteria where the greatest degree of cellular diversity lies. However, the characterization and assessment of viral diversity in natural environments is only beginning to become a possibility. Through the development of a novel technique for the harvest of viral DNA and the application of 454 pyrosequencing, a snapshot of the diversity of the DNA viruses harvested from a standing pond on a cattle farm has been obtained. A high abundance of viral genotypes (785) were present within the virome. The absolute numbers of lambdoid and Shiga toxin (Stx) encoding phages detected suggested that the depth of sequencing had enabled recovery of only ca. 8% of the total virus population, numbers that agreed within less than an order of magnitude with predictions made by rarefaction analysis. The most abundant viral genotypes in the pond were bacteriophages (93.7%). The predominant viral genotypes infecting higher life forms found in association with the farm were pathogens that cause disease in cattle and humans, e.g. members of the Herpesviridae. The techniques and analysis described here provide a fresh approach to the monitoring of viral populations in the aquatic environment, with the potential to become integral to the development of risk analysis tools for monitoring the dissemination of viral agents of animal, plant and human diseases

    Distributed multi-label learning on Apache Spark

    Get PDF
    This thesis proposes a series of multi-label learning algorithms for classification and feature selection implemented on the Apache Spark distributed computing model. Five approaches for determining the optimal architecture to speed up multi-label learning methods are presented. These approaches range from local parallelization using threads to distributed computing using independent or shared memory spaces. It is shown that the optimal approach performs hundreds of times faster than the baseline method. Three distributed multi-label k nearest neighbors methods built on top of the Spark architecture are proposed: an exact iterative method that computes pair-wise distances, an approximate tree-based method that indexes the instances across multiple nodes, and an approximate local sensitive hashing method that builds multiple hash tables to index the data. The results indicated that the predictions of the tree-based method are on par with those of an exact method while reducing the execution times in all the scenarios. The aforementioned method is then used to evaluate the quality of a selected feature subset. The optimal adaptation for a multi-label feature selection criterion is discussed and two distributed feature selection methods for multi-label problems are proposed: a method that selects the feature subset that maximizes the Euclidean norm of individual information measures, and a method that selects the subset of features maximizing the geometric mean. The results indicate that each method excels in different scenarios depending on type of features and the number of labels. Rigorous experimental studies and statistical analyses over many multi-label metrics and datasets confirm that the proposals achieve better performances and provide better scalability to bigger data than the methods compared in the state of the art
    • …
    corecore