10 research outputs found

    Env-Aware Anomaly Detection: Ignore Style Changes, Stay True to Content!

    Full text link
    We introduce a formalization and benchmark for the unsupervised anomaly detection task in the distribution-shift scenario. Our work builds upon the iWildCam dataset, and, to the best of our knowledge, we are the first to propose such an approach for visual data. We empirically validate that environment-aware methods perform better in such cases when compared with the basic Empirical Risk Minimization (ERM). We next propose an extension for generating positive samples for contrastive methods that considers the environment labels when training, improving the ERM baseline score by 8.7%

    AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection

    Full text link
    Analyzing the distribution shift of data is a growing research direction in nowadays Machine Learning (ML), leading to emerging new benchmarks that focus on providing a suitable scenario for studying the generalization properties of ML models. The existing benchmarks are focused on supervised learning, and to the best of our knowledge, there is none for unsupervised learning. Therefore, we introduce an unsupervised anomaly detection benchmark with data that shifts over time, built over Kyoto-2006+, a traffic dataset for network intrusion detection. This type of data meets the premise of shifting the input distribution: it covers a large time span (1010 years), with naturally occurring changes over time (eg users modifying their behavior patterns, and software updates). We first highlight the non-stationary nature of the data, using a basic per-feature analysis, t-SNE, and an Optimal Transport approach for measuring the overall distribution distances between years. Next, we propose AnoShift, a protocol splitting the data in IID, NEAR, and FAR testing splits. We validate the performance degradation over time with diverse models, ranging from classical approaches to deep learning. Finally, we show that by acknowledging the distribution shift problem and properly addressing it, the performance can be improved compared to the classical training which assumes independent and identically distributed data (on average, by up to 3%3\% for our approach). Dataset and code are available at https://github.com/bit-ml/AnoShift/

    Adaptive Distributed Data Storage for Context-Aware Applications, Journal of Telecommunications and Information Technology, 2013, nr 4

    Get PDF
    Context-aware computing is a paradigm that relies on the active use of information coming from a variety of sources, ranging from smartphones to sensors. The paradigm usually leads to storing large volumes of data that need to be processed to derive higher-level context information. The paper presents a cloud-based storage layer for managing sensitive context data. To handle the storage and aggregation of context data for context-aware applications, Clouds are perfect candidates. But a Cloud platform for context-aware computing needs to cope with several requirements: high concurrent access (all data needs to be available to potentially a large number of users), mobility support (such platform should actively use the caches on mobile devices whenever possible, but also cope with storage size limitations), real-time access guarantees – local caches should be located closer to the end-user whenever possible, and persistency (for traceability, a history of the context data should remain available). BlobSeer, a framework for Cloud data storage, is a perfect candidate for storing context data for large-scale applications. It offers capabilities such as persistency, concurrency and support for flexible storage schema requirement. On top of BlobSeer, Context Aware Framework is designed as an extension that offers context-aware data management to higher-level applications, and enables scalable high-throughput under high-concurrency. On a logical level, the most important capabilities offered by Context Aware Framework are transparency, support for mobility, real-time guarantees and support for access based on meta-information. On the physical layer, the most important capability is persistent Cloud storage

    Rethinking the Authorship Verification Experimental Setups

    Full text link
    One of the main drivers of the recent advances in authorship verification is the PAN large-scale authorship dataset. Despite generating significant progress in the field, inconsistent performance differences between the closed and open test sets have been reported. To this end, we improve the experimental setup by proposing five new public splits over the PAN dataset, specifically designed to isolate and identify biases related to the text topic and to the author's writing style. We evaluate several BERT-like baselines on these splits, showing that such models are competitive with authorship verification state-of-the-art methods. Furthermore, using explainable AI, we find that these baselines are biased towards named entities. We show that models trained without the named entities obtain better results and generalize better when tested on DarkReddit, our new dataset for authorship verification.Comment: Accepted as a short paper at the EMNLP 2022 conference. 10 pages, 5 figures, 9 table

    Environment-biased Feature Ranking for Novelty Detection Robustness

    Full text link
    We tackle the problem of robust novelty detection, where we aim to detect novelties in terms of semantic content while being invariant to changes in other, irrelevant factors. Specifically, we operate in a setup with multiple environments, where we determine the set of features that are associated more with the environments, rather than to the content relevant for the task. Thus, we propose a method that starts with a pretrained embedding and a multi-env setup and manages to rank the features based on their environment-focus. First, we compute a per-feature score based on the feature distribution variance between envs. Next, we show that by dropping the highly scored ones, we manage to remove spurious correlations and improve the overall performance by up to 6%, both in covariance and sub-population shift cases, both for a real and a synthetic benchmark, that we introduce for this task.Comment: The updated, long version of the paper is available at arXiv:2310.0373

    Distributed Data Storage in Support for Context-Aware Applications

    No full text
    International audienceContext-aware computing is a new paradigm that relies on large amounts of data collected from a variety of sources, ranging from smartphones to sensors, to automatically take smart decisions. This usually leads to large volumes of data, that need to be further processed to derive higher-level context information. Clouds have recently emerged as interesting candidates to support the storage and aggregation of such data for large-scale context-aware applications. However, specific extensions to support context-aware data need to be designed in order to be able to fully exploit the clouds' potential. In this paper we introduce such a cloud-based system, designed to support real-time processing and persistent storage of context data. Context Aware Framework is designed as an extension of the BlobSeer storage system, building a context-aware layer on top of it to enable scalable high-throughput under high-concurrency for big context data. Our experimental evaluation validates the transparency, mobility and real-time guarantees provided by our approach to context-aware applications
    corecore