10 research outputs found
Env-Aware Anomaly Detection: Ignore Style Changes, Stay True to Content!
We introduce a formalization and benchmark for the unsupervised anomaly
detection task in the distribution-shift scenario. Our work builds upon the
iWildCam dataset, and, to the best of our knowledge, we are the first to
propose such an approach for visual data. We empirically validate that
environment-aware methods perform better in such cases when compared with the
basic Empirical Risk Minimization (ERM). We next propose an extension for
generating positive samples for contrastive methods that considers the
environment labels when training, improving the ERM baseline score by 8.7%
AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection
Analyzing the distribution shift of data is a growing research direction in
nowadays Machine Learning (ML), leading to emerging new benchmarks that focus
on providing a suitable scenario for studying the generalization properties of
ML models. The existing benchmarks are focused on supervised learning, and to
the best of our knowledge, there is none for unsupervised learning. Therefore,
we introduce an unsupervised anomaly detection benchmark with data that shifts
over time, built over Kyoto-2006+, a traffic dataset for network intrusion
detection. This type of data meets the premise of shifting the input
distribution: it covers a large time span ( years), with naturally
occurring changes over time (eg users modifying their behavior patterns, and
software updates). We first highlight the non-stationary nature of the data,
using a basic per-feature analysis, t-SNE, and an Optimal Transport approach
for measuring the overall distribution distances between years. Next, we
propose AnoShift, a protocol splitting the data in IID, NEAR, and FAR testing
splits. We validate the performance degradation over time with diverse models,
ranging from classical approaches to deep learning. Finally, we show that by
acknowledging the distribution shift problem and properly addressing it, the
performance can be improved compared to the classical training which assumes
independent and identically distributed data (on average, by up to for
our approach). Dataset and code are available at
https://github.com/bit-ml/AnoShift/
Adaptive Distributed Data Storage for Context-Aware Applications, Journal of Telecommunications and Information Technology, 2013, nr 4
Context-aware computing is a paradigm that relies on the active use of information coming from a variety of sources, ranging from smartphones to sensors. The paradigm usually leads to storing large volumes of data that need to be processed to derive higher-level context information. The paper presents a cloud-based storage layer for managing sensitive context data. To handle the storage and aggregation of context data for context-aware applications, Clouds are perfect candidates. But a Cloud platform for context-aware computing needs to cope with several requirements: high concurrent access (all data needs to be available to potentially a large number of users), mobility support (such platform should actively use the caches on mobile devices whenever possible, but also cope with storage size limitations), real-time access guarantees â local caches should be located closer to the end-user whenever possible, and persistency (for traceability, a history of the context data should remain available). BlobSeer, a framework for Cloud data storage, is a perfect candidate for storing context data for large-scale applications. It offers capabilities such as persistency, concurrency and support for flexible storage schema requirement. On top of BlobSeer, Context Aware Framework is designed as an extension that offers context-aware data management to higher-level applications, and enables scalable high-throughput under high-concurrency. On a logical level, the most important capabilities offered by Context Aware Framework are transparency, support for mobility, real-time guarantees and support for access based on meta-information. On the physical layer, the most important capability is persistent Cloud storage
Rethinking the Authorship Verification Experimental Setups
One of the main drivers of the recent advances in authorship verification is
the PAN large-scale authorship dataset. Despite generating significant progress
in the field, inconsistent performance differences between the closed and open
test sets have been reported. To this end, we improve the experimental setup by
proposing five new public splits over the PAN dataset, specifically designed to
isolate and identify biases related to the text topic and to the author's
writing style. We evaluate several BERT-like baselines on these splits, showing
that such models are competitive with authorship verification state-of-the-art
methods. Furthermore, using explainable AI, we find that these baselines are
biased towards named entities. We show that models trained without the named
entities obtain better results and generalize better when tested on DarkReddit,
our new dataset for authorship verification.Comment: Accepted as a short paper at the EMNLP 2022 conference. 10 pages, 5
figures, 9 table
Environment-biased Feature Ranking for Novelty Detection Robustness
We tackle the problem of robust novelty detection, where we aim to detect
novelties in terms of semantic content while being invariant to changes in
other, irrelevant factors. Specifically, we operate in a setup with multiple
environments, where we determine the set of features that are associated more
with the environments, rather than to the content relevant for the task. Thus,
we propose a method that starts with a pretrained embedding and a multi-env
setup and manages to rank the features based on their environment-focus. First,
we compute a per-feature score based on the feature distribution variance
between envs. Next, we show that by dropping the highly scored ones, we manage
to remove spurious correlations and improve the overall performance by up to
6%, both in covariance and sub-population shift cases, both for a real and a
synthetic benchmark, that we introduce for this task.Comment: The updated, long version of the paper is available at
arXiv:2310.0373
Distributed Data Storage in Support for Context-Aware Applications
International audienceContext-aware computing is a new paradigm that relies on large amounts of data collected from a variety of sources, ranging from smartphones to sensors, to automatically take smart decisions. This usually leads to large volumes of data, that need to be further processed to derive higher-level context information. Clouds have recently emerged as interesting candidates to support the storage and aggregation of such data for large-scale context-aware applications. However, specific extensions to support context-aware data need to be designed in order to be able to fully exploit the clouds' potential. In this paper we introduce such a cloud-based system, designed to support real-time processing and persistent storage of context data. Context Aware Framework is designed as an extension of the BlobSeer storage system, building a context-aware layer on top of it to enable scalable high-throughput under high-concurrency for big context data. Our experimental evaluation validates the transparency, mobility and real-time guarantees provided by our approach to context-aware applications