330 research outputs found
ADGym: Design Choices for Deep Anomaly Detection
Deep learning (DL) techniques have recently found success in anomaly
detection (AD) across various fields such as finance, medical services, and
cloud computing. However, most of the current research tends to view deep AD
algorithms as a whole, without dissecting the contributions of individual
design choices like loss functions and network architectures. This view tends
to diminish the value of preliminary steps like data preprocessing, as more
attention is given to newly designed loss functions, network architectures, and
learning paradigms. In this paper, we aim to bridge this gap by asking two key
questions: (i) Which design choices in deep AD methods are crucial for
detecting anomalies? (ii) How can we automatically select the optimal design
choices for a given AD dataset, instead of relying on generic, pre-existing
solutions? To address these questions, we introduce ADGym, a platform
specifically crafted for comprehensive evaluation and automatic selection of AD
design elements in deep methods. Our extensive experiments reveal that relying
solely on existing leading methods is not sufficient. In contrast, models
developed using ADGym significantly surpass current state-of-the-art
techniques.Comment: NeurIPS 2023. The first three authors contribute equally. Code
available at https://github.com/Minqi824/ADGy
Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective
This paper takes a problem-oriented perspective and presents a comprehensive
review of transfer learning methods, both shallow and deep, for cross-dataset
visual recognition. Specifically, it categorises the cross-dataset recognition
into seventeen problems based on a set of carefully chosen data and label
attributes. Such a problem-oriented taxonomy has allowed us to examine how
different transfer learning approaches tackle each problem and how well each
problem has been researched to date. The comprehensive problem-oriented review
of the advances in transfer learning with respect to the problem has not only
revealed the challenges in transfer learning for visual recognition, but also
the problems (e.g. eight of the seventeen problems) that have been scarcely
studied. This survey not only presents an up-to-date technical review for
researchers, but also a systematic approach and a reference for a machine
learning practitioner to categorise a real problem and to look up for a
possible solution accordingly
The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification
A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for
various applications, promoting environmental sustainability and good resource management.
Although, their production continues to be a challenging task. There are various factors
that contribute towards the difficulty of generating accurate, timely updated LULC maps,
both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a
crucial step for any Machine Learning task, is particularly important in the remote sensing
domain due to the overwhelming amount of raw, unlabeled data continuously gathered
from multiple remote sensing missions. However a significant part of the state-of-the-art
focuses on scenarios with full access to labeled training data with relatively balanced class
distributions. This thesis focuses on the challenges found in automatic LULC classification
tasks, specifically in data preprocessing tasks. We focus on the development of novel
Active Learning (AL) and imbalanced learning techniques, to improve ML performance in
situations with limited training data and/or the existence of rare classes. We also show
that much of the contributions presented are not only successful in remote sensing problems,
but also in various other multidisciplinary classification problems. The work presented
in this thesis used open access datasets to test the contributions made in imbalanced
learning and AL. All the data pulling, preprocessing and experiments are made available at
https://github.com/joaopfonseca/publications. The algorithmic implementations are made
available in the Python package ml-research at https://github.com/joaopfonseca/ml-research
Towards Data-centric Graph Machine Learning: Review and Outlook
Data-centric AI, with its primary focus on the collection, management, and
utilization of data to drive AI models and applications, has attracted
increasing attention in recent years. In this article, we conduct an in-depth
and comprehensive review, offering a forward-looking outlook on the current
efforts in data-centric AI pertaining to graph data-the fundamental data
structure for representing and capturing intricate dependencies among massive
and diverse real-life entities. We introduce a systematic framework,
Data-centric Graph Machine Learning (DC-GML), that encompasses all stages of
the graph data lifecycle, including graph data collection, exploration,
improvement, exploitation, and maintenance. A thorough taxonomy of each stage
is presented to answer three critical graph-centric questions: (1) how to
enhance graph data availability and quality; (2) how to learn from graph data
with limited-availability and low-quality; (3) how to build graph MLOps systems
from the graph data-centric view. Lastly, we pinpoint the future prospects of
the DC-GML domain, providing insights to navigate its advancements and
applications.Comment: 42 pages, 9 figure
Methods for generating and evaluating synthetic longitudinal patient data: a systematic review
The proliferation of data in recent years has led to the advancement and
utilization of various statistical and deep learning techniques, thus
expediting research and development activities. However, not all industries
have benefited equally from the surge in data availability, partly due to legal
restrictions on data usage and privacy regulations, such as in medicine. To
address this issue, various statistical disclosure and privacy-preserving
methods have been proposed, including the use of synthetic data generation.
Synthetic data are generated based on some existing data, with the aim of
replicating them as closely as possible and acting as a proxy for real
sensitive data. This paper presents a systematic review of methods for
generating and evaluating synthetic longitudinal patient data, a prevalent data
type in medicine. The review adheres to the PRISMA guidelines and covers
literature from five databases until the end of 2022. The paper describes 17
methods, ranging from traditional simulation techniques to modern deep learning
methods. The collected information includes, but is not limited to, method
type, source code availability, and approaches used to assess resemblance,
utility, and privacy. Furthermore, the paper discusses practical guidelines and
key considerations for developing synthetic longitudinal data generation
methods
A Comprehensive Survey on Rare Event Prediction
Rare event prediction involves identifying and forecasting events with a low
probability using machine learning and data analysis. Due to the imbalanced
data distributions, where the frequency of common events vastly outweighs that
of rare events, it requires using specialized methods within each step of the
machine learning pipeline, i.e., from data processing to algorithms to
evaluation protocols. Predicting the occurrences of rare events is important
for real-world applications, such as Industry 4.0, and is an active research
area in statistical and machine learning. This paper comprehensively reviews
the current approaches for rare event prediction along four dimensions: rare
event data, data processing, algorithmic approaches, and evaluation approaches.
Specifically, we consider 73 datasets from different modalities (i.e.,
numerical, image, text, and audio), four major categories of data processing,
five major algorithmic groupings, and two broader evaluation approaches. This
paper aims to identify gaps in the current literature and highlight the
challenges of predicting rare events. It also suggests potential research
directions, which can help guide practitioners and researchers.Comment: 44 page
Label-efficient Time Series Representation Learning: A Review
The scarcity of labeled data is one of the main challenges of applying deep
learning models on time series data in the real world. Therefore, several
approaches, e.g., transfer learning, self-supervised learning, and
semi-supervised learning, have been recently developed to promote the learning
capability of deep learning models from the limited time series labels. In this
survey, for the first time, we provide a novel taxonomy to categorize existing
approaches that address the scarcity of labeled data problem in time series
data based on their dependency on external data sources. Moreover, we present a
review of the recent advances in each approach and conclude the limitations of
the current works and provide future directions that could yield better
progress in the field.Comment: Under Revie
Improving Active Learning Performance through the Use of Data Augmentation
Fonseca, J., & Bacao, F. (2023). Improving Active Learning Performance through the Use of Data Augmentation. International Journal of Intelligent Systems, 2023, 1-17. https://doi.org/10.1155/2023/7941878 --- Funding: This research was supported by three research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciencia e a Tecnologia”): SFRH/BD/151473/2021 - MIT Portugal PhD Grant; DSAIPA/DS/0116/2019, and PCIF/SSI/0102/2017.Active learning (AL) is a well-known technique to optimize data usage in training, through the interactive selection of unlabeled observations, out of a large pool of unlabeled data, to be labeled by a supervisor. Its focus is to find the unlabeled observations that, once labeled, will maximize the informativeness of the training dataset, therefore reducing data-related costs. The literature describes several methods to improve the effectiveness of this process. Nonetheless, there is a paucity of research developed around the application of artificial data sources in AL, especially outside image classification or NLP. This paper proposes a new AL framework, which relies on the effective use of artificial data. It may be used with any classifier, generation mechanism, and data type and can be integrated with multiple other state-of-the-art AL contributions. This combination is expected to increase the ML classifier’s performance and reduce both the supervisor’s involvement and the amount of required labeled data at the expense of a marginal increase in computational time. The proposed method introduces a hyperparameter optimization component to improve the generation of artificial instances during the AL process as well as an uncertainty-based data generation mechanism. We compare the proposed method to the standard framework and an oversampling-based active learning method for more informed data generation in an AL context. The models’ performance was tested using four different classifiers, two AL-specific performance metrics, and three classification performance metrics over 15 different datasets. We demonstrated that the proposed framework, using data augmentation, significantly improved the performance of AL, both in terms of classification performance and data selection efficiency (all the codes and preprocessed data developed for this study are available at https://github.com/joaopfonseca/publications/).publishersversionpublishe
Machine Learning for Synthetic Data Generation: A Review
Data plays a crucial role in machine learning. However, in real-world
applications, there are several problems with data, e.g., data are of low
quality; a limited number of data points lead to under-fitting of the machine
learning model; it is hard to access the data due to privacy, safety and
regulatory concerns. Synthetic data generation offers a promising new avenue,
as it can be shared and used in ways that real-world data cannot. This paper
systematically reviews the existing works that leverage machine learning models
for synthetic data generation. Specifically, we discuss the synthetic data
generation works from several perspectives: (i) applications, including
computer vision, speech, natural language, healthcare, and business; (ii)
machine learning methods, particularly neural network architectures and deep
generative models; (iii) privacy and fairness issue. In addition, we identify
the challenges and opportunities in this emerging field and suggest future
research directions
- …